Energy Consumption Forecasting¶

Name: Tan Wen Tao Bryan
Admin No: P2214449
Class: DAAA/FT/2A/01


Project Objective¶

  • To build a time-series model to forecast the gas consumption, electricity consumption and water consumption in the future and obtain the most accurate prediction

Index¶

  1. Exploratory Data Analysis
    1.1 Data Exploration
    1.2 Data Visualisation
  2. Data Cleaning/Feature Engineering
  3. Model Selection
  4. Model Evaluation
  5. Model Improvement
  6. Summary

References¶

  1. Repsol, 2023. Towards more sustainable household energy consumption. [online]. Available from: https://www.repsol.com/en/energy-and-the-future/future-of-the-world/energy-consumption/index.cshtml [Accessed at 24 Jul 2023].
  2. cre8tiveitsolutions, 2022. Why is energy consumption increasing? [online]. internetgeography. Available from: https://www.internetgeography.net/topics/why-is-energy-consumption-increasing/#:~:text=The%20reasons%20for%20increasing%20energy,rising%20population%20and%20technological%20developments [Accessed at 14 Jul 2023].
  3. US Climate Resilience Toolkit, 2019. Energy Consumption. [online]. Available from: https://toolkit.climate.gov/topics/energy-supply-and-use/energy-consumption [Accessed at 24 Jul 2023].
  4. Vedantu, 2023. Advantages and Limitations of Forecasting. [online]. Available from: https://www.vedantu.com/commerce/advantages-and-limitations-of-forecasting [Accessed at 24 Jul 2023].
  5. Robert, N., 2020. ARIMA models for time series forecasting. [online]. Duke University. Available from: https://people.duke.edu/~rnau/arimrule.htm [Accessed at 10 Aug 2023].
  6. Smigel, L., 2022. How to Interpret ARIMA Results. [online]. Analysing Alpha. Available from: https://analyzingalpha.com/interpret-arima-results [Accessed at 10 Aug 2023].
  7. Maitra, S., 2021. Take Time-Series a Level-Up with Walk-Forward Validation. [online]. Medium. Available from: https://sarit-maitra.medium.com/take-time-series-a-level-up-with-walk-forward-validation-217c33114f68#:~:text=Walk%20Forward%20Validation%20is%20mainly,for%20out%20of%20sample%20testing. [Accessed 10 Aug 2023].
  8. Brendan, A.,2022. Time Series Forecasting with ARIMA , SARIMA and SARIMAX. [online]. Medium. Available from: https://towardsdatascience.com/time-series-forecasting-with-arima-sarima-and-sarimax-ee61099e78f6 [Accessed at 10 Aug 2023].
  9. Lewinson, E., 2020. Choosing the correct error metric: MAPE vs. sMAPE. [online]. Towards Data Science. Available from: https://towardsdatascience.com/choosing-the-correct-error-metric-mape-vs-smape-5328dec53fac [Accessed at 10 Aug 2023].
  10. TrainDataHub., 2021. How to Interpret ACF and PACF plots for Identifying AR, MA, ARMA, or ARIMA Models. [online]. Medium. Available from: https://medium.com/@ooemma83/how-to-interpret-acf-and-pacf-plots-for-identifying-ar-ma-arma-or-arima-models-498717e815b6 [Accessed at 10 Aug 2023].
  11. Vang, J., 2023. ARIMA. [online]. One-Off Coder. Available from: https://datascience.oneoffcoder.com/index.html [Accessed at 10 Aug 2023].
  12. Kumar G, V., 2023. Statistical Tests to Check Stationarity in Time Series. [online]. Analytics Vidhya. Available from: https://www.analyticsvidhya.com/blog/2021/06/statistical-tests-to-check-stationarity-in-time-series-part-1/ [Accessed at 10 Aug 2023].
  13. Rob J, H., 2018. Time series components. [online]. Forecasting: Principles and Practice. Available from: https://otexts.com/fpp2/components.html [Accessed at 10 Aug 2023].
  14. Barretto, D., 2022. Time Series Part 2: Forecasting with SARIMAX models: An Intro. [online]. mkbdatalab. Available from: https://www.jadsmkbdatalab.nl/forecasting-with-sarimax-models/ [Accessed at 10 Aug 2023].
  15. Susan, L., 2020. A Quick Introduction On Granger Causality Testing For Time Series Analysis. [online]. Towards Data Science. Available from: https://towardsdatascience.com/a-quick-introduction-on-granger-causality-testing-for-time-series-analysis-7113dc9420d2 [Accessed at 10 Aug 2023].
  16. Stephanie Glen. 2023. "Engle Granger Test" [online] From StatisticsHowTo.com: Elementary Statistics for the rest of us! Available from: https://www.statisticshowto.com/engle-granger-test/ [Accessed at 10 Aug 2023].
  17. Stephanie Glen. 2023. "Johansen’s Test: Simple Definition" [online] From StatisticsHowTo.com: Elementary Statistics for the rest of us! Available from: https://www.statisticshowto.com/johansens-test/ [Accessed at 10 Aug 2023].
  18. Tiwari, A., 2020. Let’s Forecast Your Time Series using Classical Approaches [online] Medium. Available from: https://towardsdatascience.com/lets-forecast-your-time-series-using-classical-approaches-f84eb982212c [Accessed at 10 Aug 2023].
  19. Stack Exchange, 2022. What method can be used to detect seasonality in data? [online]. crossvalidated. Available from: https://stats.stackexchange.com/questions/16117/what-method-can-be-used-to-detect-seasonality-in-data [Accessed at 10 Aug 2023].
  20. Rehal, V., 2022. Impulse Response Function After VAR and VECM. [online]. Spur Economics. Available from: https://spureconomics.com/impulse-response-functions-after-var-and-vecm/ [Accessed at 10 Aug 2023].
  21. Analyttica Datalab, 2021. Understanding Durbin-Watson Test. [online]. Medium. Available from: https://medium.com/@analyttica/durbin-watson-test-fde429f79203 [Accessed at 10 Aug 2023].

sklearn links:

  1. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html

pmdarima links:

  1. https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.ARIMA.html
  2. https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.AutoARIMA.html
  3. https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.ADFTest.html
  4. https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.KPSSTest.html

scipy links:

  1. https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.periodogram.html

statsmodel links:

  1. https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.coint.html
  2. https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.grangercausalitytests.html
  3. https://www.statsmodels.org/dev/vector_ar.html#impulse-response-analysis
  4. https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.varmax.VARMAX.score.html
  5. https://docs.w3cub.com/statsmodels/generated/statsmodels.tsa.statespace.varmax.varmax.impulse_responses

Background Info¶

Forecasting is a method of examining past and current statistics movements and patterns in order to gain some insight or hints about future trends and business movements. Forecasting is looking into the future to help beneficials prepare for it accordingly.

Energy consumption is the total amount of energy required for a given process. This includes the use of electricity, gas, water, oil and biomass.

Various Factors Affecting Rate of Energy Consumption¶

  • Natural Factors:
    • Rising temperature
      • Warmer winters will decrease the need for heating, reducing energy consumption. Hotter summers will increase the demand for cooling.
    • Change in energy sources
      • Mix of energy sources used in each country and state might differ with the increased temperature. Winter heating is powered by a mixture of electricity, fuel oil, and natural gas, whereas summer cooling is powered by electricity. The expected consequence of this situation is reduced demand for fuel oil and natural gas and increased demand for electricity. Using wind and solar to generate electricity to meet this demand may shift consumption of fossil fuels to more renewable sources of energy.
  • Human Factors:
    • Agriculture
      • As a country becomes more developed, there is a higher demand for food, leading to more intensive farming techniques which require additional energy to power machinery, provide lighting and heating. The processing, manufacturing and transport of food also lead to increased energy demands. Commercial farming as a country also increase the consumption of energy. Certain times of the year like special holidays will also have a higher demand of certain agricultural products which means more energy is used to increase the supply.
    • Industry
      • Rapid industralisation leads to the development of manufacturing and processing industries which consume large levels of energy especially due to the rise in technological advancements. This increase the huge demand for many products and services which causes the industries to consume energy to build up the supply to meet the demand.
    • Economical Development
      • As a country GDP increases, more citizens in the country will rise to a higher SES. The demand for energy grows with the increased purchases of domestic appliances, leisure and recreation activities.
    • Global Crisis
      • This includes many crisis that can affect how people consume energy at home and in industries such as: Pandemics, Geopolitical Developments, Wars
    • Political Development
      • Certain countries like Singapore have passed some energy-saving policies that can reduce the consumption of energy like establishing energy efficiency sectors, encouraging the adoption of best practices and energy-saving technologies.

Advantages of Energy Consumption Forecasting:¶

  • Helps in Scheduling: Help to prepare the organisation/authority for the future. Forecasting helps them to prepare of what the future may behold for the society/business which allows them to plan out measures in place for them.
  • Weak Spots Detection: Another benefit of forecasting is that it can help the manager find any weak points that the company may have or overlooked areas. When attention has been drawn to these areas, successful controls and preparation strategies to fix them can be put into practice by the manager.
  • Enhances Coordination and Control: Information and data from a lot of external and internal sources are needed for forecasting. This knowledge is obtained from different internal sources by the various managers and employees. Thus, nearly all of the organization's divisions and verticals are involved in the forecasting process. This facilitates greater cooperation and communication between them.

Importing Libraries¶

In [2]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

#statsmodels libraries EDA
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.tsa.statespace.tools import diff
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import grangercausalitytests, coint
from statsmodels.tsa.vector_ar.vecm import coint_johansen
from statsmodels.stats.stattools import durbin_watson
from statsmodels.tsa.vector_ar.var_model import FEVD

#transform libraries (for EDA)
from sklearn.preprocessing import PowerTransformer
from scipy.signal import periodogram

#cartesion product library (use for joining dataframe columns)
from itertools import product

#time-series models
from pmdarima.arima import auto_arima, ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.statespace.varmax import VARMAX

#cross validation
from sklearn.model_selection import TimeSeriesSplit

#evaluation metrics
from sklearn.metrics import mean_absolute_percentage_error, r2_score, mean_squared_error

Importing Dataset¶

In [3]:
#import energyConsumption_dataset csv file as a dataframe
energyConsumption_df = pd.read_csv('./ST1511_CA2_dataset/Energy Consumption Dataset.csv', sep=",")
display(energyConsumption_df)
DATE Gas Consumption (tons) Electricity Consumption (MWh) Water Consumption (tons)
0 1/1/1990 18.0 725.1 548.8
1 1/2/1990 15.8 706.7 640.7
2 1/3/1990 17.3 624.5 511.1
3 1/4/1990 18.9 574.7 515.3
4 1/5/1990 22.0 553.2 488.4
... ... ... ... ...
392 1/9/2022 27.7 986.2 513.3
393 1/10/2022 31.8 936.1 373.1
394 1/11/2022 31.0 973.4 343.9
395 1/12/2022 32.4 1147.2 348.3
396 1/1/2023 31.3 1294.0 260.2

397 rows × 4 columns

Observations:

  • For the year 2023, only 1 month (January) of data exist in the dataset
  • Data seems to record the first day of every month

1.1) Data Exploration¶

Nature of the dataset: Contains 397 rows and 4 columns

  • DATE: Date Format is in (dd/mm/yyyy)
  • Gas Consumption (tons): Amount of gas being consumed daily in tons
  • Electricity Consumption (MWh): Amount of electricity being consumed daily in MWh
  • Water Consumption (tons): Amount of water being consumed daily in tons

Descriptive Statistics¶

In [3]:
#Make a copy to prevent mutation
energyConsumption_ds = energyConsumption_df.copy()

#Shape of dataset
print(energyConsumption_ds.shape)
(397, 4)
In [4]:
print(energyConsumption_ds.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 4 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   DATE                           397 non-null    object 
 1   Gas Consumption (tons)         397 non-null    float64
 2   Electricity Consumption (MWh)  397 non-null    float64
 3   Water Consumption (tons)       397 non-null    float64
dtypes: float64(3), object(1)
memory usage: 12.5+ KB
None

Observations:

  • 397 rows and 4 columns
  • 1 categorical column, 3 numerical columns
  • All columns has an equal number of observations
  • Date needs to be in date type and not object type
  • No missing values in every row
In [5]:
#Convert date to date type by formatting in the appropriate Date format
energyConsumption_ds["DATE"] = pd.to_datetime(energyConsumption_ds["DATE"], format='%d/%m/%Y')
energyConsumption_ds.set_index('DATE', inplace=True)
display(energyConsumption_ds.head())
Gas Consumption (tons) Electricity Consumption (MWh) Water Consumption (tons)
DATE
1990-01-01 18.0 725.1 548.8
1990-02-01 15.8 706.7 640.7
1990-03-01 17.3 624.5 511.1
1990-04-01 18.9 574.7 515.3
1990-05-01 22.0 553.2 488.4
In [6]:
#Descriptive Stats
energy_stats=energyConsumption_ds.describe(include="all").T
display(energy_stats)
count mean std min 25% 50% 75% max
Gas Consumption (tons) 397.0 23.785139 4.903452 11.6 20.2 23.5 27.9 46.0
Electricity Consumption (MWh) 397.0 888.472544 153.877594 553.2 771.1 897.8 1005.2 1294.0
Water Consumption (tons) 397.0 484.953652 133.908863 44.4 384.4 487.4 580.2 811.0

Observations:

  • Between 75% and max value for Gas Consumption (tons), there was a huge jump in values compared to the other first 75% of the values, suggesting that there is an outlier in the last 25% of the data

1.2) Data Visualisation¶

Time Series Visualisation¶

In [7]:
#Unique years that the dataset consist of
np.unique(energyConsumption_ds.index.year)
Out[7]:
array([1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
       2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
       2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022,
       2023], dtype=int64)
In [8]:
#Visualise each column as a time series plot
fig = plt.figure(figsize=(18,13), tight_layout=True)
for i, column in enumerate(energyConsumption_ds.columns):
    ax = fig.add_subplot(len(energyConsumption_ds.columns), 1, i+1)
    energyConsumption_ds[column].plot(ax=ax)
    ax.set_title(f"Time Series Visualisation of {column}", fontsize=16)
    ax.grid(True)
plt.show()

Observations:

  • All these plots show that these columns are a non-stationary series
  • Electricity consumption shows there is a slight increase in mean value, hence not stationary
  • Gas consumption & water consumption shows there is a change in mean and variance value, hence not stationary

Distribution (Univariate Analysis)¶

In [9]:
#Plot density based histograms to show distribution
fig, ax = plt.subplots(1, 3, figsize=(15,6))
sns.histplot(data=energyConsumption_ds, x='Gas Consumption (tons)', kde=True, stat="density", ax=ax[0])
ax[0].set_title("Distribution of Gas Consumption")
ax[0].set_yticks(np.arange(0,0.081, 0.01))

sns.histplot(data=energyConsumption_ds, x='Electricity Consumption (MWh)', kde=True, stat="density", ax=ax[1])
ax[1].set_title("Distribution of Electricity Consumption")

sns.histplot(data=energyConsumption_ds, x='Water Consumption (tons)', kde=True, stat="density", ax=ax[2])
ax[2].set_title("Distribution of Water Consumption")

plt.tight_layout()
plt.show()
In [10]:
#Boxplots to show median and distribution of data
fig, ax = plt.subplots(1, 3, figsize=(11,6))
sns.boxplot(data=energyConsumption_ds, y='Gas Consumption (tons)',  ax=ax[0])
ax[0].set_title("Distribution of Gas Consumption")

sns.boxplot(data=energyConsumption_ds, y='Electricity Consumption (MWh)', ax=ax[1])
ax[1].set_title("Distribution of Electricity Consumption")

sns.boxplot(data=energyConsumption_ds, y='Water Consumption (tons)', ax=ax[2])
ax[2].set_title("Distribution of Water Consumption")

plt.tight_layout()
plt.show()

Observations:

  • Gas consumption is slightly positively skewed but it is generally normally distributed apart from the outlier.
  • Generally, the electricity and water consumption seem to be normally distributed.
  • Gas consumption & Water consumption has 1 outlier each that do not really seem to affect the normal distribution.
In [11]:
# Split each column into two groups and compare whether the mean and variance of each group matches
for i in energyConsumption_ds.columns:
    X1 = energyConsumption_ds[i].iloc[:len(energyConsumption_ds[i])//2]
    X2 = energyConsumption_ds[i].iloc[len(energyConsumption_ds[i])//2:]
    print(f'For {i}, \ngroup 1 mean = {X1.mean():.2f}, group 2 mean = {X2.mean():.2f}')
    print(f'group 1 variance = {X1.var():.2f}, group 2 variance = {X2.var():.2f}')
    print()
For Gas Consumption (tons), 
group 1 mean = 24.28, group 2 mean = 23.29
group 1 variance = 24.48, group 2 variance = 23.23

For Electricity Consumption (MWh), 
group 1 mean = 774.04, group 2 mean = 1002.33
group 1 variance = 12208.96, group 2 variance = 9083.62

For Water Consumption (tons), 
group 1 mean = 426.87, group 2 mean = 542.75
group 1 variance = 13118.77, group 2 variance = 16079.41

Time Series Decomposition¶

  • Breaking down the time series into its underlying components to further understand the patterns, fluctuations and trend present
  • Some tests that I will perform:
    1. Seasonal Decomposition
    2. Test for Stationarity
      • ADF test
      • KPSS test
    3. Identification of orders (p,d,q),(P,D,Q,S)
      • Differencing
      • Autocorrelation Analysis
    4. Test for Causality
      • Granger's Causality Test
    5. Test for Cointegration
      • Engle Granger Cointegration Test
      • Johansen's Cointegration Test

Seasonal Decomposition¶

Time series is being represented by either a sum or a product of three components - (Trend, Seasonality & Random Component)

In [12]:
# Perform seasonal decomposition
for i in energyConsumption_ds.columns:
    print(f'Column: {i}')
    plt.rc("figure",figsize=(18,10))
    decomposition = seasonal_decompose(energyConsumption_ds[i])
    decomposition.plot()
    plt.show()
Column: Gas Consumption (tons)
Column: Electricity Consumption (MWh)
Column: Water Consumption (tons)

Observations:

  • There is a slight upward linear in the trend for Electricity Consumption but non-linear trend for Gas Consumption & Water Consumption
  • Seasonality seems to be occurring repeatedly similarly in a cyclic manner for all the features
  • Pattern repeats once every year ((1996-1992)/4 cyclic patterns) = 1 cycle per year
  • Residuals seem to be quite generally constant for Electricity Consumption
  • Residuals seem to show a sharp spike around 2014 and anomal decrease for other years for Gas Consumption
  • Residuals seem to show a very sharp decrease around 2014 for Water Consumption

Test for Stationarity (Augmented Dickey-Fuller Test)¶

A stationary series is one where the properties of mean, variance and covariance do not vary with time, whereas a non-stationary series has properites which vary with time. ADF test determines how strongly a time series is defined by a trend. It can be determined if the series is stationary or not.

  • $H_0$: The series is non-stationary/The series has a unit root.
  • $H_1$: The series is stationary/The series has no unit root.
In [13]:
#ADF test
def adf_test(time_series, significant_value=0.05, name="", verbose=False):
    unit_root=adfuller(time_series, autolag="BIC")
    test_statistic = round(unit_root[0],4)
    p_value = round(unit_root[1],4)
    n_lags = round(unit_root[2], 4)
    
    #Print summary
    print(f'ADF Test on {name}')
    print('-'*50)
    print(f'Significance level = {significant_value}')
    print(f'Test Statistic = {test_statistic}')
    print(f'No of Lags = {n_lags}')
    
    #Critical values for each percentage
    for key, val in unit_root[4].items():
        print(f"Critical Value ({key})= {round(val, 3)}")
    
    #Find out if p-value is smaller than or larger than the significance value
    if p_value<=significant_value:
        print(f'P-Value = {p_value} Null hypothesis is rejected.')
        print(f'The series is stationary.')
    else:
        print(f'P-Value = {p_value} Insufficient evidence to reject the null hypothesis.')
        print(f'The series is non-stationary.')
        
#Initiating functions       
for column in energyConsumption_ds.columns:
    adf_test(energyConsumption_ds[column], name=column)
    print()
ADF Test on Gas Consumption (tons)
--------------------------------------------------
Significance level = 0.05
Test Statistic = -6.5666
No of Lags = 1
Critical Value (1%)= -3.447
Critical Value (5%)= -2.869
Critical Value (10%)= -2.571
P-Value = 0.0 Null hypothesis is rejected.
The series is stationary.

ADF Test on Electricity Consumption (MWh)
--------------------------------------------------
Significance level = 0.05
Test Statistic = -2.0912
No of Lags = 12
Critical Value (1%)= -3.447
Critical Value (5%)= -2.869
Critical Value (10%)= -2.571
P-Value = 0.2481 Insufficient evidence to reject the null hypothesis.
The series is non-stationary.

ADF Test on Water Consumption (tons)
--------------------------------------------------
Significance level = 0.05
Test Statistic = -7.0564
No of Lags = 1
Critical Value (1%)= -3.447
Critical Value (5%)= -2.869
Critical Value (10%)= -2.571
P-Value = 0.0 Null hypothesis is rejected.
The series is stationary.

Observations:

  • The series for Gas Consumption & Water Consumption has a p-value of 0 which means that those time series data are stationary.
  • Test statistic of -6.5666 for Gas Consumption & test statistic of -7.0564 for Water Consumption are lower than all the critical values, which further shows that null hypothesis is rejected and both series are stationary.
  • The series for Electricity Consumption has a p-value of 0.248 which means that time series data is non-stationary.
  • Test statistic of -2.0912 for Electricity Consumption is greater than all the critical values, which further shows that there are insufficient evidence to reject the null hypothesis and the series is non-stationary.

Test for Stationarity (Kwiatkowski-Phillips-Schmidt-Shin Test)¶

KPSS test is a type of unit root test that tests for the stationarity of a given series around a deterministic trend.

  • $H_0$: The series is trend stationary or the series has no unit root.
  • $H_1$: The series is non-stationary or series has a unit root.
In [14]:
#KPSS test
warnings.filterwarnings('ignore')
def kpss_test(time_series, significant_value=0.05, name="", verbose=False):
    unit_root=kpss(time_series)
    test_statistic = round(unit_root[0],4)
    p_value = round(unit_root[1],4)
    n_lags = round(unit_root[2], 4)
    
    #Print summary
    print(f'KPSS Test on {name}')
    print('-'*50)
    print(f'Significance level = {significant_value}')
    print(f'Test Statistic = {test_statistic}')
    print(f'No of Lags = {n_lags}')
    
    #Critical values for each percentage
    for key, val in unit_root[3].items():
        print(f"Critical Value ({key})= {round(val, 3)}")
    
    #Find out if p-value is smaller than or larger than the significance value
    if p_value<=significant_value:
        print(f'P-Value = {p_value} Null hypothesis is rejected.')
        print(f'The series is non-stationary.')
    else:
        print(f'P-Value = {p_value} Insufficient evidence to reject the null hypothesis.')
        print(f'The series is trend stationary.')
        
#Initiating functions       
for column in energyConsumption_ds.columns:
    kpss_test(energyConsumption_ds[column], name=column)
    print()
KPSS Test on Gas Consumption (tons)
--------------------------------------------------
Significance level = 0.05
Test Statistic = 0.3402
No of Lags = 10
Critical Value (10%)= 0.347
Critical Value (5%)= 0.463
Critical Value (2.5%)= 0.574
Critical Value (1%)= 0.739
P-Value = 0.1 Insufficient evidence to reject the null hypothesis.
The series is trend stationary.

KPSS Test on Electricity Consumption (MWh)
--------------------------------------------------
Significance level = 0.05
Test Statistic = 3.5316
No of Lags = 10
Critical Value (10%)= 0.347
Critical Value (5%)= 0.463
Critical Value (2.5%)= 0.574
Critical Value (1%)= 0.739
P-Value = 0.01 Null hypothesis is rejected.
The series is non-stationary.

KPSS Test on Water Consumption (tons)
--------------------------------------------------
Significance level = 0.05
Test Statistic = 0.8388
No of Lags = 10
Critical Value (10%)= 0.347
Critical Value (5%)= 0.463
Critical Value (2.5%)= 0.574
Critical Value (1%)= 0.739
P-Value = 0.01 Null hypothesis is rejected.
The series is non-stationary.

Observations:

  • For Gas Consumption, test statistics is lower than their individual critical values which means that there is insufficient evidence to reject the null hypothesis, implying that the series are trend stationary.
  • For Electricity Consumption & Water Consumption, both test statistic are larger than the individual critical values which means that the null hypothesis is rejected and both time series are non-stationary.

Possible Outcomes to Truly Stationary Series¶

  • Case 1: Both tests conclude that the given series is stationary – The series is stationary
  • Case 2: Both tests conclude that the given series is non-stationary – The series is non-stationary
  • Case 3: ADF concludes non-stationary, and KPSS concludes stationary – The series is trend stationary. To make the series strictly stationary, the trend needs to be removed in this case. Then the detrended series is checked for stationarity.
  • Case 4: ADF concludes stationary, and KPSS concludes non-stationary – The series is difference stationary. Differencing is to be used to make series stationary. Then the differenced series is checked for stationarity

Stationarity Observations¶

  • For Gas Consumption, ADF test & KPSS test shows that the the series is stationary. Hence, the series is stationary.
  • For Electricity Consumption, ADF test & KPSS shows that the series are non-stationary. Hence, the series is non-stationary.
  • For Water Consumption, ADF test shows that the series is stationary while KPSS test shows that the series is non-stationary. Hence, the series is difference stationary.

Attempted Transformation (Box-Cox)¶

  • Make data more Gaussian-like, requires input to be strictly positive.
  • Stabilizing variance and minimizing skewness is estimated through maximum likelihood.
In [15]:
# def transform_box_cox(data):
#     transformer=PowerTransformer(method='box-cox')
#     transformed_data = transformer.fit_transform(data)
#     return transformed_data
In [16]:
# columnsToBeTransformed = [["Gas Consumption (tons)", "Electricity Consumption (MWh)", "Water Consumption (tons)"]]

# for i in columnsToBeTransformed:
#     energyConsumption_ds[i]=transform_box_cox(energyConsumption_ds[i])
    
# display(energyConsumption_ds.head())

Identification of Orders (p,d,q)¶

  • Before we identify the orders for the time series models, we have to ensure that the time series is stationary first by applying differencing on the non-stationary time series.
  • Next, we will use the ACF & PACF plot to determine the orders to be used for ARIMA & SARIMA.

Differencing (1st Round)¶

  • In this case, we need to make sure that Electricity Consumption is stationary first.
  • Water Consumption has to be differenced first and check whether it is stationary, since it is difference stationary.
  • First-Order Differencing: $$ y'_t=y_t-y_{t-1} $$
  • $y_t$: value with time

Water Consumption¶

In [17]:
#Perform differencing on Water Consumption (order=1)
water_diff=diff(energyConsumption_ds["Water Consumption (tons)"], k_diff=1)
adf_result1=adfuller(energyConsumption_ds["Water Consumption (tons)"], autolag="BIC")
adf_result2=adfuller(water_diff, autolag="BIC")
kpss_result1=kpss(energyConsumption_ds["Water Consumption (tons)"])
kpss_result2=kpss(water_diff)

print('ADF-test (p-value before differencing): %f' % adf_result1[1])
print('ADF-test (p-value after differencing): %f' % adf_result2[1])
print('KPSS-test (p-value before differencing): %f' % kpss_result1[1])
print('KPSS-test (p-value after differencing): %f' % kpss_result2[1])

fig, ax = plt.subplots(1, 2, figsize=(14, 4))
energyConsumption_ds["Water Consumption (tons)"].plot(ax=ax[0])
ax[0].grid(True)
water_diff.plot(ax=ax[1])
ax[1].grid(True)
ax[0].set_title('Time Series Before Differencing')
ax[1].set_title('Time Series After Differencing')
fig.suptitle("Water Consumption (tons)", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()
ADF-test (p-value before differencing): 0.000000
ADF-test (p-value after differencing): 0.000000
KPSS-test (p-value before differencing): 0.010000
KPSS-test (p-value after differencing): 0.100000

Observations:

  • ADF test shows a p-value smaller than the significance value of 0.05, null hypothesis is rejected.
  • KPSS test shows a p-value larger than the significance value of 0.05, little evidence to show that null hypothesis is rejected.
  • This show that for Water Consumption, the series differenced order of 1 is stationary.

Electricity Consumption¶

In [18]:
#Perform differencing on Electricity Consumption (order=1)
electricity_diff=diff(energyConsumption_ds["Electricity Consumption (MWh)"], k_diff=1)
adf_result1=adfuller(energyConsumption_ds["Electricity Consumption (MWh)"], autolag="BIC")
adf_result2=adfuller(electricity_diff, autolag="BIC")
kpss_result1=kpss(energyConsumption_ds["Electricity Consumption (MWh)"])
kpss_result2=kpss(electricity_diff)


print('ADF-test (p-value before differencing): %f' % adf_result1[1])
print('ADF-test (p-value after differencing): %f' % adf_result2[1])
print('KPSS-test (p-value before differencing): %f' % kpss_result1[1])
print('KPSS-test (p-value after differencing): %f' % kpss_result2[1])

fig, ax = plt.subplots(1, 2, figsize=(14, 4))
energyConsumption_ds["Electricity Consumption (MWh)"].plot(ax=ax[0])
ax[0].grid(True)
electricity_diff.plot(ax=ax[1])
ax[1].grid(True)
ax[0].set_title('Time Series Before Differencing')
ax[1].set_title('Time Series After Differencing')
fig.suptitle("Electricity Consumption (kWh)", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()
ADF-test (p-value before differencing): 0.248052
ADF-test (p-value after differencing): 0.000000
KPSS-test (p-value before differencing): 0.010000
KPSS-test (p-value after differencing): 0.100000

Observations:

  • ADF test shows a p-value smaller than the significance value of 0.05, null hypothesis is rejected.
  • KPSS test shows a p-value larger than the significance value of 0.05, little evidence to show that null hypothesis is rejected.
  • This show that for Electricity Consumption, the series differenced order of 1 is stationary.
In [19]:
#Reassign value in the dataset to the differenced value and fill in missing value with 0
energyConsumption_ds["Water Consumption (tons)"]=water_diff
energyConsumption_ds["Electricity Consumption (MWh)"]=electricity_diff
energyConsumption_ds=energyConsumption_ds.fillna(0)
display(energyConsumption_ds.head())
Gas Consumption (tons) Electricity Consumption (MWh) Water Consumption (tons)
DATE
1990-01-01 18.0 0.0 0.0
1990-02-01 15.8 -18.4 91.9
1990-03-01 17.3 -82.2 -129.6
1990-04-01 18.9 -49.8 4.2
1990-05-01 22.0 -21.5 -26.9

Autocorrelation Analysis¶

Autocorrelation is the correlation of a time series and its lagged version over time.

  • ACF: Measures the degree of similarity between a given time series and the lagged version of that time series at the various intervals observed
  • PACF: Measures the degree of similarity between a given time series and the lagged version of that time series at the various intervals observed in which only the direct effect has been shown, and all other intermediary effects are removed from the given time series

AR & MA Process (ARIMA)¶

  • ARIMA will be one of the models commonly used for time series, made up of Auto Regressive (AR), Integrated (I) & Moving Average (MA) process
  • AR(p) process: Current values depend on its own p-previous values
  • MA (q) process: The current deviation from mean depends on q-previous deviations

How to identify orders of p and q from ACF & PACF¶

  • Model is AR if the ACF trails off after a lag and has a hard cut-off in the PACF after a lag. This lag is taken as the value for p.
  • Model is MA if the PACF trails off after a lag and has a hard cut-off in the ACF after a lag. This lag is taken as the value for q.
  • Model is a mix of AR & MA if both the ACF & PACF trail off.
In [20]:
#Plot acf and pacf graphs
for i in energyConsumption_ds.columns:
    fig,ax=plt.subplots(1,2, tight_layout=True)
    plot_acf(energyConsumption_ds[i], ax=ax[0])
    plot_pacf(energyConsumption_ds[i], ax=ax[1])
    fig.suptitle(i, fontsize=14, fontweight="bold")
    plt.show()

Observations:

  • For Gas Consumption, even though there is a tailoff in the ACF plot, it has a relatively high number of lags (>10) hence an order of differencing is needed.
  • For Electricity Consumption, there is a sine-like curve which shows a noticeable seasonal pattern after the first order of differencing, hence seasonal differencing is needed. A significant lag spike is noticed to happen every 6 lags and 12 lags.
  • For Water Consumption, there is a significant lag spike at lag 1 in ACF plot & there is a gradual decrease in the PACF plot, so we shall use Moving Average model with q parameter of 1.

Differencing (2nd Round)¶

  • Now, based on the observations from the ACF and PACF, we shall apply the changes accordingly.
  • I will apply the first order differencing on Gas Consumption now.
  • Electricity Consumption will require a seasonal differencing.

Gas Consumption¶

In [21]:
#Perform differencing on Gas Consumption (order=1)
gas_diff=diff(energyConsumption_ds["Gas Consumption (tons)"], k_diff=1)
adf_result1=adfuller(energyConsumption_ds["Gas Consumption (tons)"], autolag="BIC")
adf_result2=adfuller(gas_diff, autolag="BIC")
kpss_result1=kpss(energyConsumption_ds["Gas Consumption (tons)"])
kpss_result2=kpss(gas_diff)


print('ADF-test (p-value before differencing): %f' % adf_result1[1])
print('ADF-test (p-value after differencing): %f' % adf_result2[1])
print('KPSS-test (p-value before differencing): %f' % kpss_result1[1])
print('KPSS-test (p-value after differencing): %f' % kpss_result2[1])

fig, ax = plt.subplots(1, 2, figsize=(14, 4))
energyConsumption_ds["Gas Consumption (tons)"].plot(ax=ax[0])
ax[0].grid(True)
gas_diff.plot(ax=ax[1])
ax[1].grid(True)
ax[0].set_title('Time Series Before Differencing')
ax[1].set_title('Time Series After Differencing')
fig.suptitle("Gas Consumption (tons)", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()
ADF-test (p-value before differencing): 0.000000
ADF-test (p-value after differencing): 0.000000
KPSS-test (p-value before differencing): 0.100000
KPSS-test (p-value after differencing): 0.100000
In [22]:
#Reassign value in the dataset to the differenced value and fill in missing value with 0
energyConsumption_ds["Gas Consumption (tons)"]=gas_diff
energyConsumption_ds=energyConsumption_ds.fillna(0)
display(energyConsumption_ds.head())
Gas Consumption (tons) Electricity Consumption (MWh) Water Consumption (tons)
DATE
1990-01-01 0.0 0.0 0.0
1990-02-01 -2.2 -18.4 91.9
1990-03-01 1.5 -82.2 -129.6
1990-04-01 1.6 -49.8 4.2
1990-05-01 3.1 -21.5 -26.9
In [23]:
#Plot acf & pacf plot for Gas Consumption
fig,ax=plt.subplots(1,2, tight_layout=True)
plot_acf(energyConsumption_ds["Gas Consumption (tons)"], ax=ax[0])
plot_pacf(energyConsumption_ds["Gas Consumption (tons)"], ax=ax[1])
fig.suptitle("Gas Consumption (tons)", fontsize=14, fontweight="bold")
plt.show()

Observations:

  • For Gas Consumption, there is a significant lag spike at lag 1 in ACF plot & there is a gradual decrease in the PACF plot, so we shall use Moving Average model with q parameter of 1.

Electricity Consumption¶

In [24]:
#Re-assign the original values for Electricity Consumption
energyConsumption_ds["Electricity Consumption (MWh)"]=energyConsumption_df["Electricity Consumption (MWh)"].values
display(energyConsumption_ds.head())
Gas Consumption (tons) Electricity Consumption (MWh) Water Consumption (tons)
DATE
1990-01-01 0.0 725.1 0.0
1990-02-01 -2.2 706.7 91.9
1990-03-01 1.5 624.5 -129.6
1990-04-01 1.6 574.7 4.2
1990-05-01 3.1 553.2 -26.9

Electricity Consumption (Periodogram)¶

  • Used to reveal underlying periodic components present in a time series dataset.
  • Helps identify seasonal patterns in time series data
  • Identify the most dominant frequency
In [25]:
#Plot Periodogram
freq, power = periodogram(energyConsumption_df["Electricity Consumption (MWh)"])
plt.figure(figsize=(8, 6))

period = 1/freq
plt.plot(period, power, color='indianred')

plt.xlim(0, 15)
plt.xticks(np.arange(0, 15, 2))
plt.xlabel('Period')
plt.ylabel('Power Spectral Density')
plt.title('Periodogram of Electricity Consumption (Mwh)')
plt.show()

Observations:

  • Even though the periodogram clearly shows that 6th month is the dominant frequency, 12th month also has a potential frequency, so these 2 can be the potential seasonality parameters for seasonality component.
  • We shall look at the acf & pacf plot to decide whether to use S=6 or S=12 as the seasonality.
In [26]:
#Re-assign the original values for Electricity Consumption
energyConsumption_ds["Electricity Consumption (MWh)"]=energyConsumption_df["Electricity Consumption (MWh)"].values
display(energyConsumption_ds.head())
Gas Consumption (tons) Electricity Consumption (MWh) Water Consumption (tons)
DATE
1990-01-01 0.0 725.1 0.0
1990-02-01 -2.2 706.7 91.9
1990-03-01 1.5 624.5 -129.6
1990-04-01 1.6 574.7 4.2
1990-05-01 3.1 553.2 -26.9
In [27]:
#Calculate the seasonal period with the lowest standard deviation
S = np.arange(1, 13)
season_std = []
for season in S:
    data = energyConsumption_ds["Electricity Consumption (MWh)"]
    seasonality = data.diff(periods=season).fillna(0)
    std = seasonality.std()
    season_std.append(std)

#Plot a graph (Seasonal Period Vs Standard Deviation)
plt.figure(figsize=(8,6))
ax=plt.axes()
ax=sns.lineplot(x=S, y=season_std, marker="o", color="#316FBE")
#Get the lowest standard deviation
lowest_std = min(season_std)
plt.axhline(y=lowest_std, color="#b58900", linestyle="--")
plt.yticks(np.arange(30, 151, 10))

plt.xlabel("Seasonal Period")
plt.ylabel("Standard Deviation")
plt.title("Seasonal Period Vs Standard Deviation", fontweight="bold", fontsize=14)
plt.show()

Observations:

  • According to an article posted by the Duke University, the optimal order of differencing for time series is often the order of differencing at which the standard deviation is lowest.
  • Hence, seasonal period of 12 is the most optimal seasonal period.

Electricity Consumption (S=6)¶

  • Looking at the acf plot and pacf plot when seasonal component=6
In [28]:
#Perform differencing on Electricity Consumption (order=1)
electricity_diff=diff(energyConsumption_ds["Electricity Consumption (MWh)"], 
                      k_diff=1, k_seasonal_diff=1, seasonal_periods=6)
adf_result1=adfuller(energyConsumption_ds["Electricity Consumption (MWh)"], autolag="BIC")
adf_result2=adfuller(electricity_diff, autolag="BIC")
kpss_result1=kpss(energyConsumption_ds["Electricity Consumption (MWh)"])
kpss_result2=kpss(electricity_diff)


print('ADF-test (p-value before differencing): %f' % adf_result1[1])
print('ADF-test (p-value after differencing): %f' % adf_result2[1])
print('KPSS-test (p-value before differencing): %f' % kpss_result1[1])
print('KPSS-test (p-value after differencing): %f' % kpss_result2[1])

fig, ax = plt.subplots(1, 2, figsize=(14, 4))
energyConsumption_ds["Electricity Consumption (MWh)"].plot(ax=ax[0])
ax[0].grid(True)
electricity_diff.plot(ax=ax[1])
ax[1].grid(True)
ax[0].set_title('Time Series Before Differencing')
ax[1].set_title('Time Series After Differencing')
fig.suptitle("Electricity Consumption (kWh)", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()
ADF-test (p-value before differencing): 0.248052
ADF-test (p-value after differencing): 0.000000
KPSS-test (p-value before differencing): 0.010000
KPSS-test (p-value after differencing): 0.100000
In [29]:
#Reassign value in the dataset to the differenced value and fill in missing value with 0
energyConsumption_ds["Electricity Consumption (MWh)"]=electricity_diff
energyConsumption_ds=energyConsumption_ds.fillna(0)
display(energyConsumption_ds.head())
Gas Consumption (tons) Electricity Consumption (MWh) Water Consumption (tons)
DATE
1990-01-01 0.0 0.0 0.0
1990-02-01 -2.2 0.0 91.9
1990-03-01 1.5 0.0 -129.6
1990-04-01 1.6 0.0 4.2
1990-05-01 3.1 0.0 -26.9
In [30]:
#Plot acf & pacf for Electricity Consumption
fig,ax=plt.subplots(1,2, tight_layout=True)
plot_acf(energyConsumption_ds["Electricity Consumption (MWh)"], ax=ax[0])
plot_pacf(energyConsumption_ds["Electricity Consumption (MWh)"], ax=ax[1])
fig.suptitle("Electricity Consumption (MWh)", fontsize=14, fontweight="bold")
plt.show()

Observations:

  • For S=6, we can see that there is still an underlying seasonality (sin-curve) that can be seen in the acf plot and there is an increasing threshold which suggest that an additional order of differencing may be required.
  • Furthermore, in the case of strong seasonality, as observed in our time series, D=1. A rule of thumb is that d + D should not be greater than 2.
  • So, we have to look at another seasonal period.
In [31]:
#Re-assign the original values for Electricity Consumption
energyConsumption_ds["Electricity Consumption (MWh)"]=energyConsumption_df["Electricity Consumption (MWh)"].values
display(energyConsumption_ds.head())
Gas Consumption (tons) Electricity Consumption (MWh) Water Consumption (tons)
DATE
1990-01-01 0.0 725.1 0.0
1990-02-01 -2.2 706.7 91.9
1990-03-01 1.5 624.5 -129.6
1990-04-01 1.6 574.7 4.2
1990-05-01 3.1 553.2 -26.9

Electricity Consumption (S=12)¶

  • Looking at the acf plot and pacf plot when seasonal component=12
In [32]:
#Perform differencing on Electricity Consumption (order=1)
electricity_diff=diff(energyConsumption_ds["Electricity Consumption (MWh)"], 
                      k_diff=1, k_seasonal_diff=1, seasonal_periods=12)
adf_result1=adfuller(energyConsumption_ds["Electricity Consumption (MWh)"], autolag="BIC")
adf_result2=adfuller(electricity_diff, autolag="BIC")
kpss_result1=kpss(energyConsumption_ds["Electricity Consumption (MWh)"])
kpss_result2=kpss(electricity_diff)


print('ADF-test (p-value before differencing): %f' % adf_result1[1])
print('ADF-test (p-value after differencing): %f' % adf_result2[1])
print('KPSS-test (p-value before differencing): %f' % kpss_result1[1])
print('KPSS-test (p-value after differencing): %f' % kpss_result2[1])

fig, ax = plt.subplots(1, 2, figsize=(14, 4))
energyConsumption_ds["Electricity Consumption (MWh)"].plot(ax=ax[0])
ax[0].grid(True)
electricity_diff.plot(ax=ax[1])
ax[1].grid(True)
ax[0].set_title('Time Series Before Differencing')
ax[1].set_title('Time Series After Differencing')
fig.suptitle("Electricity Consumption (kWh)", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()
ADF-test (p-value before differencing): 0.248052
ADF-test (p-value after differencing): 0.000000
KPSS-test (p-value before differencing): 0.010000
KPSS-test (p-value after differencing): 0.100000
In [33]:
#Reassign value in the dataset to the differenced value and fill in missing value with 0
energyConsumption_ds["Electricity Consumption (MWh)"]=electricity_diff
energyConsumption_ds=energyConsumption_ds.fillna(0)
display(energyConsumption_ds.head())
Gas Consumption (tons) Electricity Consumption (MWh) Water Consumption (tons)
DATE
1990-01-01 0.0 0.0 0.0
1990-02-01 -2.2 0.0 91.9
1990-03-01 1.5 0.0 -129.6
1990-04-01 1.6 0.0 4.2
1990-05-01 3.1 0.0 -26.9
In [34]:
#Plot acf & pacf for Electricity Consumption
fig,ax=plt.subplots(1,2, tight_layout=True)
plot_acf(energyConsumption_ds["Electricity Consumption (MWh)"], ax=ax[0])
plot_pacf(energyConsumption_ds["Electricity Consumption (MWh)"], ax=ax[1])
fig.suptitle("Electricity Consumption (MWh)", fontsize=14, fontweight="bold")
plt.show()

Observations:

  • For Electricity Consumption, there is a significant lag spike at lag 2 in PACF plot & a significant lag at lag 3 in the ACF plot before the threshold level cuts it off, so we shall use a mix of Auto Regressive & Moving Average model with p parameter of 2 and a q parameter of 3.
  • According to the Duke University article, if the autocorrelation of the appropriately differenced series is positive at lag s, where s is the number of periods in a season, then consider adding an SAR term to the model. If the autocorrelation of the differenced series is negative at lag s, consider adding an SMA term to the model.
  • You should try to avoid using more than one or two seasonal parameters (SAR+SMA) in the same model, as this is likely to lead to overfitting of the data and/or problems in estimation. Hence, I will consider P=0, Q=1.

Final ACF & PACF plot¶

In [35]:
#Plot acf and pacf graphs
for i in energyConsumption_ds.columns:
    fig,ax=plt.subplots(1,2, tight_layout=True)
    plot_acf(energyConsumption_ds[i], ax=ax[0])
    plot_pacf(energyConsumption_ds[i], ax=ax[1])
    fig.suptitle(i, fontsize=14, fontweight="bold")
    plt.show()

Observations:

  • For Gas Consumption, there is a significant lag spike at lag 1 in ACF plot & there is a gradual decrease in the PACF plot, so we shall use Moving Average model with q parameter of 1.
  • For Electricity Consumption, there is a significant lag spike at lag 2 in PACF plot & a significant lag at lag 3 in the ACF plot before the threshold level cuts it off, so we shall use a mix of Auto Regressive & Moving Average model with p parameter of 2 and a q parameter of 3.
  • You should try to avoid using more than one or two seasonal parameters (SAR+SMA) in the same model, as this is likely to lead to overfitting of the data and/or problems in estimation. Hence, I will consider P=0, Q=1.
  • For Water Consumption, there is a significant lag spike at lag 1 in ACF plot & there is a gradual decrease in the PACF plot, so we shall use Moving Average model with q parameter of 1.

Order of (p,d,q) and (P,D,Q,S)

  • Gas Consumption = (0,1,1)
  • Electricity Consumption = (2,1,3),(0,1,1,12)
  • Water Consumption = (0,1,1)

Test for Causality (Grangers Causality Test)¶

Grangers Causality Test determines whether one time series is useful in forecasting another. Focuses on a rather short term causality and we can consider a Multivariate Model for time series

  • $H_0$: X does not cause the time series in Y.
  • $H_1$: X cause the time series in Y.
In [36]:
#Function for Grangers Causality Test
def grangers_causality_matrix(data, variables, maxlag=13, test = 'ssr_chi2test', verbose=False):

    dataset = pd.DataFrame(np.zeros((len(variables), len(variables))), columns=variables, index=variables)

    for c in dataset.columns:
        for r in dataset.index:
            test_result = grangercausalitytests(data[[r,c]], maxlag=maxlag, verbose=False)
            p_values = [round(test_result[i+1][0][test][1],4) for i in range(maxlag)]
            if verbose: 
                print(f'Y = {r}, X = {c}, P Values = {p_values}')

            min_p_value = np.min(p_values)
            dataset.loc[r,c] = min_p_value

    dataset.columns = [var + '_x' for var in variables]
    dataset.index = [var + '_y' for var in variables]
    return dataset
In [37]:
granger_df = grangers_causality_matrix(energyConsumption_ds, variables = energyConsumption_ds.columns)

#Print out significant pairs that show causality
pairs = granger_df.unstack().sort_values(kind="quicksort")
mask = pairs < 0.05
print("Significant Pairs:")
print(pairs[mask])

plt.figure(figsize=(8,6))
sns.heatmap(granger_df, annot=True, cmap='seismic', vmin=0)
plt.title("Grangers Causality Test")
plt.show()
Significant Pairs:
Gas Consumption (tons)_x  Water Consumption (tons)_y    0.0052
dtype: float64

Observations:

  • If p-value is less than 0.05, we would reject null hypothesis and it means X cause the time series in Y.
  • Gas Consumption causes the time series in Water Consumption.

Test for Cointegration (Engle Granger Cointegration Test)¶

Cointegration refers to long-term equilibrium relationship between non-stationary variables, means they move together in the long run despite having individual trends.

  • $H_0$: There is no cointegration.
  • $H_1$: There is a cointegrating relationship.

Limitation: Assumes that the cointegrating relationship is linear and there is only one linear relationship among the variables is being tested.

In [38]:
#Function for Engle-Granger two-step cointegration test
def EngleGranger_coint_test(dataframe, critical_level = 0.05):
    n = dataframe.shape[1] 
    pvalue_matrix = np.ones((n, n)) 
    keys = dataframe.columns 
    pairs = [] 
    for i in range(n):
        for j in range(i+1, n): 
            series1 = dataframe[keys[i]] 
            series2 = dataframe[keys[j]]
            result = coint(series1, series2) 
            pvalue = result[1] 
            pvalue_matrix[i, j] = pvalue
            if pvalue < critical_level: 
                pairs.append((keys[i], keys[j], pvalue))
    return pvalue_matrix, pairs
In [39]:
#Create dataframe to store the matrix
pvalue_matrix, pairs = EngleGranger_coint_test(energyConsumption_ds)
coint_pvalue_matrix_df = pd.DataFrame(pvalue_matrix)

for pair in pairs:
    print(pair)

plt.figure(figsize=(8,6))
sns.heatmap(coint_pvalue_matrix_df, xticklabels=energyConsumption_ds.columns,
                   yticklabels=energyConsumption_ds.columns, annot=True, cmap='seismic')
plt.title('Engle-Granger Cointegration Test')
plt.show()
('Gas Consumption (tons)', 'Electricity Consumption (MWh)', 7.310878998998921e-11)
('Gas Consumption (tons)', 'Water Consumption (tons)', 1.2408872926483798e-16)
('Electricity Consumption (MWh)', 'Water Consumption (tons)', 1.356676425276096e-11)

Observations:

  • It looks like in the long run, every feature has a cointegrating relationship with all the other features as they have p-values of less than 0.05 which means null hypothesis is rejected.

Test for Cointegration (Johansen's Cointegration Test)¶

An extension of Engle-Granger two-step cointegration test and can handle cases when there are more than two stationary variables involved.

  • $H_0$: There is no cointegration relationship among the variables.
  • $H_1$: There are one or more cointegrating relationships among the variables.
In [40]:
coint_johansen = coint_johansen(energyConsumption_ds, 1, 1)

critical_vals = {"0.90": 0, "0.95": 1, "0.99": 2}
#Trace statistic
trace_stat = coint_johansen.lr1
cvt = coint_johansen.cvt[:, critical_vals[str(1 - 0.05)]]

# Summary
print("Column  : Test Statistic > CI(95%)   =>   Significant")
for col, trace, ci in zip(energyConsumption_ds.columns, trace_stat, cvt):
    print(col, ":", round(trace, 2), ">", ci, " => ", trace > ci)
Column  : Test Statistic > CI(95%)   =>   Significant
Gas Consumption (tons) : 813.96 > 35.0116  =>  True
Electricity Consumption (MWh) : 510.52 > 18.3985  =>  True
Water Consumption (tons) : 230.69 > 3.8415  =>  True

Observations:

  • Gas Consumption, Electricity Consumption, Water Consumption have a significant cointegration which suggests that there is a correlation between the 3 values in the long run.

2) Data Cleaning/Feature Engineering¶

  • Now, I will perform the same data cleaning on the orignal dataset, energyConsumption_df.
  • Splitting the dataset into training and testing data

Setting Index¶

In [4]:
#Convert date to date type by formatting in the appropriate Date format
energyConsumption_df["DATE"] = pd.to_datetime(energyConsumption_df["DATE"], format='%d/%m/%Y')
energyConsumption_df.set_index('DATE', inplace=True)
display(energyConsumption_df.head())
Gas Consumption (tons) Electricity Consumption (MWh) Water Consumption (tons)
DATE
1990-01-01 18.0 725.1 548.8
1990-02-01 15.8 706.7 640.7
1990-03-01 17.3 624.5 511.1
1990-04-01 18.9 574.7 515.3
1990-05-01 22.0 553.2 488.4

Splitting Train & Test Dataset¶

In [5]:
#Reason why this date was chosen to split: 80% data before this date & 20% data during and after this date
split_date = '2016-06-01'
train_set = energyConsumption_df[energyConsumption_df.index < split_date]
test_set = energyConsumption_df[energyConsumption_df.index >= split_date]
print(f"train.shape = {train_set.shape}, test.shape = {test_set.shape}")
print(type(train_set.index))
train.shape = (317, 3), test.shape = (80, 3)
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
In [43]:
#Visualise train and test time series in a time series plot
fig = plt.figure(figsize=(18,13), tight_layout=True)
for i, column in enumerate(energyConsumption_ds.columns):
    ax = fig.add_subplot(len(energyConsumption_ds.columns), 1, i+1)
    train_set[column].plot(ax=ax, color="dodgerblue")
    test_set[column].plot(ax=ax, color="green")
    ax.set_title(f"Time Series Visualisation of {column}", fontsize=16)
    ax.grid(True)
plt.show()

3) Model Selection¶

The following Time-Series Models are used:

1. Auto ARIMA (part of ARIMA)¶

  • Automated version of the ARIMA
  • Automatically searches for the best combination of ARIMA hyperparameters (p,d,q) using algorithms like grid search or stepwise search
  • Evaluate different parameter combinations using BIC to choose the best fitting model
  • Quickly identify suitable ARIMA models for different time series without manual parameter tuning

2. ARIMA (Auto Regressive Integrated Moving Average Model)¶

  • Time series forecasting model that combines autoregressive (AR) and moving average (MA) components
  • (AR) component: Models the dependency of the current value on its past values (lags).
  • (MA) component: Models the dependency of the current value on past forecast errors (residuals).
  • Uses (p,d,q) where p is order of AR component, d is the degree of differencing, q is the order of the MA component

3. SARIMA (Seasonal Auto Regressive Integrated Moving Average Model)¶

  • Extends the basic ARIMA model to include seasonality patterns
  • Introduces seasonal autoregressive (SAR) and seasonal moving average (SMA) components to capture the periodic patterns in the data
  • Uses (p,d,q),(P,D,Q,s) where the lowercase letters correspond to the non-seasonal components, uppercase letters correspond to the seasonal components and s represents the seasonality period

4. VARMA (Vector Auto Regressive Moving Average Model)¶

  • Multivariate extension of the ARIMA model designed to handle multiple time series variables that can influence each other
  • Each variable is regressed on its own past values as well as the past values of other variables (AR)
  • Model uses lagged forecast errors (residuals) of the same variable and other variables to predict future values (MA)
  • Uses (p,q) where p is the order of the autoregressive component and q is the order of the moving average component
  • Useful when several related time series variables interact with one another

4) Model Evaluation¶

A few methods I used to evaluate how each model would handle each column:

  • Walk-Forward Validation
    • A special technique of Cross Validation.
    • Process involving dividing time series data into training & testing set and updating the model at each step, incorporating the latest data and making predictions for the next time step.
    • Assess the model ability to adapt and changing patterns in time series. Walk Forward Validation (Source: Miro, Medium) Image Source: Miro Medium
    • Use graphical visualization to see how accurate predictions are to actual data

  • Summaries of Models
    • $P>|z|$ column: shows whether each coefficient is significant or not
      • $H_0$: Each coefficient is not statistically significant.
      • $H_1$: Each coefficient is statistically significant.
    • Ljung Box Test: checks if an autocorrelation exists in a time series
      • $H_0$: The residuals are independently distributed. (white noise)
      • $H_1$: The residuals are not independently distributed. (not white distributed)
    • Heteroscedasticity Test: checks if error residuals are homoscedastic or have the same variance
      • $H_0$: The residuals show variance.
      • $H_1$: The residuals do not show variance.
    • Jarque-Bera Test: checks for normality of errors
      • $H_0$: The data is normally distributed against an alternative of another distribution.
      • $H_1$: The data is not normally distributed against an alternative of another distribution.
    • Log-Likelihood: identifies a distribution that fits best with the sampled data
    • AIC: determines the strength of the linear regression model. The AIC penalizes a model for adding parameters since adding more parameters will always increase the maximum likelihood value.
    • BIC: punishes a model for complexity, but it also incorporates the number of rows in the data. (More Info Below)
    • HQIC: another criterion for model selection, not commonly used in practice
  • Diagnostics Plot
    • Made up of standardized residuals plot, histogram with kde plot, Normal Q-Q plot and Correlogram
  • Metrics:
    • MAPE (Mean Absolute Percentage Error)
      • Measures the percentage difference between the predicted values and the actual values in the time series $$ MAPE = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right| \times 100 $$
    • RMSE (Root Mean Square Error)
      • Measures the average magnitude of the errors between forecasted values and actual values in the time series $$ RMSE = \sqrt{\frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{n}} $$
    • SMAPE (Symmetric Mean Absolute Percentage Error)
      • Measures the absolute percentage error for each individual forecasted value and take the mean of these absolute percentage errors across all time steps in the test set
      • Does not distinguish between over-predictions and under-predictions and give equal weight to both $$ SMAPE = \frac{100}{n} \sum_{i=1}^{n} \frac{|y_i - \hat{y}_i|}{(|y_i| + |\hat{y}_i|)/2} $$

Auto ARIMA¶

I will be using this metrics to see which is the best order of parameters. I will just check the summary to see how well the ARIMA model do with the order Auto ARIMA deem the best.

BIC (Bayesian Information Criterian)¶

  • Entails a calculation of maximum log-likelihood and a higher penalty term for a higher number of parameters $$ BIC_i = -2logL_i+p_ilogn $$

  • $L$: Max likelihood - parameter with the highest probability of correctly representing the relationship between input & output

  • $p$: Number of parameters
  • $n$: Number of values in the dataset
  • The lower the value is, the better the model is

Gas Consumption (Auto ARIMA)¶

In [44]:
#Stepwise set to True to see the BIC for every parameter searched
arima_modelGas=auto_arima(train_set["Gas Consumption (tons)"], start_p=1, start_q=1,
                       test='adf', max_p=5, max_q=5, m=1, information_criterion='bic',          
                       seasonal=False, start_P=0, D=None, trace=True,
                       error_action="ignore", suppress_warnings=True, stepwise=True)
print(arima_modelGas)
Performing stepwise search to minimize bic
 ARIMA(1,1,1)(0,0,0)[0] intercept   : BIC=1726.744, Time=0.18 sec
 ARIMA(0,1,0)(0,0,0)[0] intercept   : BIC=1794.151, Time=0.05 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : BIC=1763.273, Time=0.06 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : BIC=1745.416, Time=0.07 sec
 ARIMA(0,1,0)(0,0,0)[0]             : BIC=1788.396, Time=0.03 sec
 ARIMA(2,1,1)(0,0,0)[0] intercept   : BIC=1730.887, Time=0.43 sec
 ARIMA(1,1,2)(0,0,0)[0] intercept   : BIC=1731.259, Time=0.21 sec
 ARIMA(0,1,2)(0,0,0)[0] intercept   : BIC=1736.710, Time=0.13 sec
 ARIMA(2,1,0)(0,0,0)[0] intercept   : BIC=1762.028, Time=0.10 sec
 ARIMA(2,1,2)(0,0,0)[0] intercept   : BIC=1735.755, Time=0.24 sec
 ARIMA(1,1,1)(0,0,0)[0]             : BIC=1720.989, Time=0.07 sec
 ARIMA(0,1,1)(0,0,0)[0]             : BIC=1739.661, Time=0.04 sec
 ARIMA(1,1,0)(0,0,0)[0]             : BIC=1757.519, Time=0.04 sec
 ARIMA(2,1,1)(0,0,0)[0]             : BIC=1725.131, Time=0.22 sec
 ARIMA(1,1,2)(0,0,0)[0]             : BIC=1725.504, Time=0.09 sec
 ARIMA(0,1,2)(0,0,0)[0]             : BIC=1730.958, Time=0.05 sec
 ARIMA(2,1,0)(0,0,0)[0]             : BIC=1756.274, Time=0.04 sec
 ARIMA(2,1,2)(0,0,0)[0]             : BIC=1729.999, Time=0.12 sec

Best model:  ARIMA(1,1,1)(0,0,0)[0]          
Total fit time: 2.192 seconds
 ARIMA(1,1,1)(0,0,0)[0]          
In [45]:
arima_modelGas.summary()
Out[45]:
SARIMAX Results
Dep. Variable: y No. Observations: 317
Model: SARIMAX(1, 1, 1) Log Likelihood -851.861
Date: Thu, 10 Aug 2023 AIC 1709.721
Time: 07:26:44 BIC 1720.989
Sample: 01-01-1990 HQIC 1714.223
- 05-01-2016
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 0.4200 0.043 9.873 0.000 0.337 0.503
ma.L1 -0.9052 0.036 -24.899 0.000 -0.976 -0.834
sigma2 12.8153 0.503 25.479 0.000 11.829 13.801
Ljung-Box (L1) (Q): 0.27 Jarque-Bera (JB): 2007.30
Prob(Q): 0.60 Prob(JB): 0.00
Heteroskedasticity (H): 2.28 Skew: 0.68
Prob(H) (two-sided): 0.00 Kurtosis: 15.27


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Electricity Consumption (Auto ARIMA)¶

In [46]:
#Stepwise set to True to see the BIC for every parameter searched
#Does not account for seasonality
arima_modelElectricity=auto_arima(train_set["Electricity Consumption (MWh)"], start_p=1, start_q=1,
                       test='adf', max_p=5, max_q=5, information_criterion='bic',          
                       seasonal=False, start_P=0, D=None, trace=True,
                       error_action="ignore", suppress_warnings=True, stepwise=True)
print(arima_modelElectricity)
Performing stepwise search to minimize bic
 ARIMA(1,0,1)(0,0,0)[0]             : BIC=3533.438, Time=0.08 sec
 ARIMA(0,0,0)(0,0,0)[0]             : BIC=5193.516, Time=0.03 sec
 ARIMA(1,0,0)(0,0,0)[0]             : BIC=inf, Time=0.03 sec
 ARIMA(0,0,1)(0,0,0)[0]             : BIC=inf, Time=0.08 sec
 ARIMA(2,0,1)(0,0,0)[0]             : BIC=3538.824, Time=0.16 sec
 ARIMA(1,0,2)(0,0,0)[0]             : BIC=3470.074, Time=0.16 sec
 ARIMA(0,0,2)(0,0,0)[0]             : BIC=inf, Time=0.17 sec
 ARIMA(2,0,2)(0,0,0)[0]             : BIC=3447.339, Time=0.29 sec
 ARIMA(3,0,2)(0,0,0)[0]             : BIC=inf, Time=0.39 sec
 ARIMA(2,0,3)(0,0,0)[0]             : BIC=3421.661, Time=0.29 sec
 ARIMA(1,0,3)(0,0,0)[0]             : BIC=3417.635, Time=0.25 sec
 ARIMA(0,0,3)(0,0,0)[0]             : BIC=inf, Time=0.38 sec
 ARIMA(1,0,4)(0,0,0)[0]             : BIC=3404.362, Time=0.34 sec
 ARIMA(0,0,4)(0,0,0)[0]             : BIC=inf, Time=0.50 sec
 ARIMA(2,0,4)(0,0,0)[0]             : BIC=3382.828, Time=0.55 sec
 ARIMA(3,0,4)(0,0,0)[0]             : BIC=3426.799, Time=0.67 sec
 ARIMA(2,0,5)(0,0,0)[0]             : BIC=3342.279, Time=0.51 sec
 ARIMA(1,0,5)(0,0,0)[0]             : BIC=3350.089, Time=0.46 sec
 ARIMA(3,0,5)(0,0,0)[0]             : BIC=inf, Time=0.79 sec
 ARIMA(2,0,5)(0,0,0)[0] intercept   : BIC=3456.562, Time=0.71 sec

Best model:  ARIMA(2,0,5)(0,0,0)[0]          
Total fit time: 6.840 seconds
 ARIMA(2,0,5)(0,0,0)[0]          
In [47]:
#Stepwise set to True to see the BIC for every parameter searched
#Account for seasonality
arima_modelElectricity=auto_arima(train_set["Electricity Consumption (MWh)"], start_p=1, start_q=1,
                       test='adf', max_p=5, max_q=5, m=12, d=1, max_P=3,max_Q=3, information_criterion='bic',          
                       seasonal=True, start_P=0, D=None, trace=True,
                       error_action="ignore", suppress_warnings=True, stepwise=True)
print(arima_modelElectricity)
Performing stepwise search to minimize bic
 ARIMA(1,1,1)(0,1,1)[12]             : BIC=2772.885, Time=0.75 sec
 ARIMA(0,1,0)(0,1,0)[12]             : BIC=2938.015, Time=0.03 sec
 ARIMA(1,1,0)(1,1,0)[12]             : BIC=2883.919, Time=0.13 sec
 ARIMA(0,1,1)(0,1,1)[12]             : BIC=2809.733, Time=0.31 sec
 ARIMA(1,1,1)(0,1,0)[12]             : BIC=inf, Time=0.20 sec
 ARIMA(1,1,1)(1,1,1)[12]             : BIC=2778.554, Time=0.96 sec
 ARIMA(1,1,1)(0,1,2)[12]             : BIC=2778.528, Time=1.64 sec
 ARIMA(1,1,1)(1,1,0)[12]             : BIC=2827.936, Time=0.46 sec
 ARIMA(1,1,1)(1,1,2)[12]             : BIC=2782.610, Time=2.83 sec
 ARIMA(1,1,0)(0,1,1)[12]             : BIC=2826.664, Time=0.28 sec
 ARIMA(2,1,1)(0,1,1)[12]             : BIC=2776.543, Time=0.68 sec
 ARIMA(1,1,2)(0,1,1)[12]             : BIC=2775.498, Time=0.75 sec
 ARIMA(0,1,0)(0,1,1)[12]             : BIC=2831.958, Time=0.17 sec
 ARIMA(0,1,2)(0,1,1)[12]             : BIC=2773.425, Time=0.50 sec
 ARIMA(2,1,0)(0,1,1)[12]             : BIC=2805.775, Time=0.38 sec
 ARIMA(2,1,2)(0,1,1)[12]             : BIC=2780.005, Time=1.05 sec
 ARIMA(1,1,1)(0,1,1)[12] intercept   : BIC=inf, Time=0.68 sec

Best model:  ARIMA(1,1,1)(0,1,1)[12]          
Total fit time: 11.845 seconds
 ARIMA(1,1,1)(0,1,1)[12]          
In [48]:
arima_modelElectricity.summary()
Out[48]:
SARIMAX Results
Dep. Variable: y No. Observations: 317
Model: SARIMAX(1, 1, 1)x(0, 1, 1, 12) Log Likelihood -1375.009
Date: Thu, 10 Aug 2023 AIC 2758.017
Time: 07:27:03 BIC 2772.885
Sample: 01-01-1990 HQIC 2763.965
- 05-01-2016
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 0.5059 0.058 8.718 0.000 0.392 0.620
ma.L1 -0.9540 0.020 -47.423 0.000 -0.993 -0.915
ma.S.L12 -0.7054 0.046 -15.283 0.000 -0.796 -0.615
sigma2 479.7243 30.285 15.840 0.000 420.366 539.083
Ljung-Box (L1) (Q): 0.33 Jarque-Bera (JB): 36.75
Prob(Q): 0.57 Prob(JB): 0.00
Heteroskedasticity (H): 2.55 Skew: -0.24
Prob(H) (two-sided): 0.00 Kurtosis: 4.64


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Water Consumption (Auto ARIMA)¶

In [49]:
#Stepwise set to True to see the BIC for every parameter searched
arima_modelWater=auto_arima(train_set["Water Consumption (tons)"], start_p=1, start_q=1,
                       test='adf', max_p=5, max_q=5, m=1, d=1, information_criterion='bic',          
                       seasonal=False, start_P=0, D=None, trace=True,
                       error_action="ignore", suppress_warnings=True, stepwise=True)
print(arima_modelWater)
Performing stepwise search to minimize bic
 ARIMA(1,1,1)(0,0,0)[0] intercept   : BIC=3854.921, Time=0.17 sec
 ARIMA(0,1,0)(0,0,0)[0] intercept   : BIC=3930.713, Time=0.03 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : BIC=3899.250, Time=0.06 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : BIC=3869.609, Time=0.11 sec
 ARIMA(0,1,0)(0,0,0)[0]             : BIC=3924.959, Time=0.03 sec
 ARIMA(2,1,1)(0,0,0)[0] intercept   : BIC=3859.638, Time=0.22 sec
 ARIMA(1,1,2)(0,0,0)[0] intercept   : BIC=3859.513, Time=0.21 sec
 ARIMA(0,1,2)(0,0,0)[0] intercept   : BIC=3861.051, Time=0.16 sec
 ARIMA(2,1,0)(0,0,0)[0] intercept   : BIC=3890.107, Time=0.06 sec
 ARIMA(2,1,2)(0,0,0)[0] intercept   : BIC=3865.002, Time=0.39 sec
 ARIMA(1,1,1)(0,0,0)[0]             : BIC=3849.228, Time=0.08 sec
 ARIMA(0,1,1)(0,0,0)[0]             : BIC=3863.853, Time=0.05 sec
 ARIMA(1,1,0)(0,0,0)[0]             : BIC=3893.496, Time=0.05 sec
 ARIMA(2,1,1)(0,0,0)[0]             : BIC=3853.953, Time=0.12 sec
 ARIMA(1,1,2)(0,0,0)[0]             : BIC=3853.831, Time=0.13 sec
 ARIMA(0,1,2)(0,0,0)[0]             : BIC=3855.314, Time=0.07 sec
 ARIMA(2,1,0)(0,0,0)[0]             : BIC=3884.352, Time=0.04 sec
 ARIMA(2,1,2)(0,0,0)[0]             : BIC=3859.325, Time=0.22 sec

Best model:  ARIMA(1,1,1)(0,0,0)[0]          
Total fit time: 2.191 seconds
 ARIMA(1,1,1)(0,0,0)[0]          
In [50]:
arima_modelWater.summary()
Out[50]:
SARIMAX Results
Dep. Variable: y No. Observations: 317
Model: SARIMAX(1, 1, 1) Log Likelihood -1915.981
Date: Thu, 10 Aug 2023 AIC 3837.961
Time: 07:27:05 BIC 3849.228
Sample: 01-01-1990 HQIC 3842.462
- 05-01-2016
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 0.3952 0.050 7.891 0.000 0.297 0.493
ma.L1 -0.9167 0.026 -34.879 0.000 -0.968 -0.865
sigma2 1.078e+04 590.046 18.265 0.000 9620.656 1.19e+04
Ljung-Box (L1) (Q): 0.16 Jarque-Bera (JB): 108.38
Prob(Q): 0.69 Prob(JB): 0.00
Heteroskedasticity (H): 1.33 Skew: -0.34
Prob(H) (two-sided): 0.14 Kurtosis: 5.79


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Observations:

  • AutoARIMA is not extremely accurate at identifying the order for ARIMA as it fails to identify complex seasonal patterns or long-term trends in the data.
  • Hence, we shall use the acf and pacf to interpret the optimal order used for our models.

ARIMA¶

Walk-Forward Validation + Score Metrics¶

  • Trained on historical data up to a certain point and then used to make one-step predictions using test data, repeat iteratively
  • Model is retrained with the updated historical data and makes predictions for each subsequent time step in the test set
In [51]:
def walk_forward_validate(column, train_set, test_set, model, original_data):
    #store predictions
    predictions=[]
    data=train_set[column].values
    model_selected=model.fit(data)
    
    #making in-sample predictions of training data
    pred_in_sample = model_selected.predict_in_sample()
    trainPredicted = pd.Series(pred_in_sample, index=train_set[column].index)
    
    for t in test_set[column].values:
        #predicting the next time step
        y=model_selected.predict(start=len(data), end=len(data))
        
        #appends one step ahead prediction
        predictions.append(y[0])
        
        #adds current value from test set to data array which include actual value from the history
        #to predict the next iteration prediction
        data = np.append(data, t)
        
        #update ARIMA model with new observation, ensure model uses updated history for next prediction
        model_selected.update(t)
        model_selected=model.fit(data)
        
        #calculate the aic & bic scores based on the order
        aic=model_selected.aic()
        bic=model_selected.bic()
    
    testPredicted = pd.Series(predictions, index=test_set[column].index)
    pred_full = np.concatenate((pred_in_sample, predictions))
    pred_full_Energy = pd.Series(pred_full, index=original_data[column].index)
    
    #calculate the MAPE & RMSE scores for train set and validaion set based on the order
    train_MAPE = round(mean_absolute_percentage_error(train_set[column], trainPredicted)*100,3)
    train_RMSE = round(mean_squared_error(train_set[column], trainPredicted, squared=False),3)
    validate_MAPE = round(mean_absolute_percentage_error(test_set[column], testPredicted)*100, 3)
    validate_RMSE = round(mean_squared_error(test_set[column], testPredicted, squared=False),3)
    
    return(
        trainPredicted, testPredicted, pred_full_Energy, model_selected, aic, bic, 
        train_MAPE, train_RMSE, validate_MAPE, validate_RMSE)

Gas Consumption (Walk Forward Validation)¶

In [52]:
#Gas Consumption Predictions for train and test data
model = ARIMA(order=(0,1,1))
(gas_trainPredicted, gas, pred_full_Gas, arima_modelGas, gas_aic, gas_bic, 
 gas_train_MAPE, gas_train_RMSE, gas_validate_MAPE, gas_validate_RMSE) = walk_forward_validate(
    "Gas Consumption (tons)", train_set, test_set, model, energyConsumption_df
)
display(pred_full_Gas.head())
DATE
1990-01-01     0.001821
1990-02-01    18.001685
1990-03-01    16.727637
1990-04-01    17.009442
1990-05-01    17.889356
dtype: float64
In [53]:
#Visualise train and test time series in a time series plot
fig, ax = plt.subplots(2,1,figsize=(13,6))

#Plot training, testing & predicted values on the graph
train_set["Gas Consumption (tons)"].plot(ax=ax[0], label="training", color='dodgerblue')
test_set["Gas Consumption (tons)"].plot(ax=ax[0], label="testing", color='green')
pred_full_Gas.plot(ax=ax[0], label='prediction', color="orange")
ax[0].legend()
ax[0].set_title(f"ARIMA for Gas Consumption (tons)", fontsize=16)
ax[0].set_yticks(np.arange(0,51,10))
ax[0].grid(True)

#Focus on the predicted values and the test values
test_set["Gas Consumption (tons)"].plot(ax=ax[1], label="testing", color='green')
gas.plot(ax=ax[1], label='prediction', color="orange")
ax[1].legend()
ax[1].set_title(f"Closer look into Test and Predicted Values", fontsize=16)
ax[1].set_yticks(np.arange(0,51,10))
ax[1].grid(True)
plt.tight_layout()
plt.show()

Observations:

  • For Gas Consumption, the predicted data seems to match the training and testing data trend even though they are not exactly similar.

Gas Consumption (Model Summary)¶

In [54]:
#print summary for Gas Consumption
arima_modelGas.summary()
Out[54]:
SARIMAX Results
Dep. Variable: y No. Observations: 397
Model: SARIMAX(0, 1, 1) Log Likelihood -1068.653
Date: Thu, 10 Aug 2023 AIC 2143.306
Time: 07:27:13 BIC 2155.250
Sample: 0 HQIC 2148.038
- 397
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 0.0341 0.087 0.394 0.694 -0.136 0.204
ma.L1 -0.5347 0.029 -18.449 0.000 -0.592 -0.478
sigma2 12.9165 0.379 34.070 0.000 12.173 13.660
Ljung-Box (L1) (Q): 4.06 Jarque-Bera (JB): 2599.64
Prob(Q): 0.04 Prob(JB): 0.00
Heteroskedasticity (H): 2.24 Skew: 0.82
Prob(H) (two-sided): 0.00 Kurtosis: 15.44


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Observations:

  • Moving average model coefficient is not significant as p-value is more than 0.5.
  • Ljung Box test shows that the residuals are not independently distributed.
  • Heteroscedasticity test shows that the residuals do not show variance.
  • Jarque-Bera Test shows the data is not normally distributed against an alternative of another distribution.

Gas Consumption (Diagnostic Plot)¶

In [55]:
#Plot diagnostic plot
arima_modelGas.plot_diagnostics(figsize=(14,10))
plt.show()

Observations:

  • There are no obvious patterns in the residuals in the standardized residual plot.
  • The histogram has a bell-shaped curve and the QQ plot shows all points lying on the line which means that the data follows a normal distribution.
  • However, the correlogram does not show that 95% of the correlations for lag do not lie within the threshold, which shows that this is not a good model.

Electricity Consumption (Walk Forward Validation)¶

  • We shall take a look at electricity consumption without taking seasonality into account by using auto arima proposed order
In [56]:
#Electricity Consumption Predictions for train and test data
model = ARIMA(order=(2,0,5))
(electricity_trainPredicted, electricity, pred_full_Elec, arima_modelElec, elec_aic, elec_bic, 
 elec_train_MAPE, elec_train_RMSE, elec_validate_MAPE, elec_validate_RMSE) = walk_forward_validate(
    "Electricity Consumption (MWh)", train_set, test_set, model, energyConsumption_df
)
display(pred_full_Elec.head())
DATE
1990-01-01    846.326878
1990-02-01    738.124861
1990-03-01    720.508326
1990-04-01    617.486736
1990-05-01    607.414031
dtype: float64
In [57]:
#Visualise train and test time series in a time series plot
fig, ax = plt.subplots(2,1,figsize=(13,6))

#Plot training, testing & predicted values on the graph
train_set["Electricity Consumption (MWh)"].plot(ax=ax[0], label="training", color='dodgerblue')
test_set["Electricity Consumption (MWh)"].plot(ax=ax[0], label="testing", color='green')
pred_full_Elec.plot(ax=ax[0], label='prediction', color="orange")
ax[0].legend()
ax[0].set_title(f"ARIMA for Electricity Consumption (MWh)", fontsize=16)
ax[0].grid(True)

test_set["Electricity Consumption (MWh)"].plot(ax=ax[1], label="testing", color='green')
electricity.plot(ax=ax[1], label='prediction', color="orange")
ax[1].legend()
ax[1].set_title(f"Closer look into Test and Predicted Values", fontsize=16)
ax[1].grid(True)
plt.tight_layout()
plt.show()

Observations:

  • For Electricity Consumption, the test and predicted data does not seem to match as seasonality is not being taken into account which means that the order that auto arima came up with does not work with this data, compared to the training data and the predicted data.
In [58]:
#print summary for Electricity Consumption
arima_modelElec.summary()
Out[58]:
SARIMAX Results
Dep. Variable: y No. Observations: 397
Model: SARIMAX(2, 0, 5) Log Likelihood -2160.346
Date: Thu, 10 Aug 2023 AIC 4338.692
Time: 07:28:12 BIC 4374.547
Sample: 0 HQIC 4352.895
- 397
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 168.4601 35.767 4.710 0.000 98.358 238.562
ar.L1 1.2754 0.113 11.303 0.000 1.054 1.497
ar.L2 -0.4643 0.105 -4.412 0.000 -0.671 -0.258
ma.L1 -0.0179 0.103 -0.173 0.863 -0.221 0.185
ma.L2 -0.2590 0.051 -5.045 0.000 -0.360 -0.158
ma.L3 -0.0672 0.058 -1.152 0.249 -0.182 0.047
ma.L4 0.2756 0.053 5.195 0.000 0.172 0.380
ma.L5 0.4211 0.063 6.641 0.000 0.297 0.545
sigma2 3095.7020 232.999 13.286 0.000 2639.033 3552.371
Ljung-Box (L1) (Q): 4.42 Jarque-Bera (JB): 0.54
Prob(Q): 0.04 Prob(JB): 0.76
Heteroskedasticity (H): 2.06 Skew: 0.06
Prob(H) (two-sided): 0.00 Kurtosis: 3.14


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Observations:

  • Coefficient for L1 & L3 are not significant as p-value is more than 0.5.
  • Ljung Box test shows that the residuals are not independently distributed.
  • Heteroscedasticity test shows that the residuals do not show variance.
  • Jarque-Bera Test shows the data is normally distributed against an alternative of another distribution.
In [59]:
#Plot diagnostic plot
arima_modelElec.plot_diagnostics(figsize=(14,10))
plt.show()

Observations:

  • There are obvious patterns in the residuals in the standardized residual plot, implying that there is seasonality occuring in this data.
  • The histogram has a bell-shaped curve and the QQ plot shows all points lying on the line which means that the data follows a normal distribution.
  • However, the correlogram does not show that 95% of the correlations for lag do not lie within the threshold, which shows that this is not a good model.

Water Consumption (Walk Forward Validation)¶

In [60]:
#Water Consumption Predictions for train and test data
model = ARIMA(order=(0,1,1))
(water_trainPredicted, water, pred_full_Water, arima_modelWater, water_aic, water_bic, 
 water_train_MAPE, water_train_RMSE, water_validate_MAPE, water_validate_RMSE) = walk_forward_validate(
    "Water Consumption (tons)", train_set, test_set, model, energyConsumption_df
)
display(pred_full_Water.head())
DATE
1990-01-01      0.002373
1990-02-01    544.919239
1990-03-01    597.541027
1990-04-01    559.785577
1990-05-01    542.052360
dtype: float64
In [61]:
#Visualise train and test time series in a time series plot
fig, ax = plt.subplots(2,1,figsize=(13,6))

#Plot training, testing & predicted values on the graph
train_set["Water Consumption (tons)"].plot(ax=ax[0], label="training", color='dodgerblue')
test_set["Water Consumption (tons)"].plot(ax=ax[0], label="testing", color='green')
pred_full_Water.plot(ax=ax[0], label='prediction', color="orange")
ax[0].legend()
ax[0].set_title(f"ARIMA for Water Consumption (tons)", fontsize=16)
ax[0].grid(True)

test_set["Water Consumption (tons)"].plot(ax=ax[1], label="testing", color='green')
water.plot(ax=ax[1], label='prediction', color="orange")
ax[1].legend()
ax[1].set_title(f"Closer look into Test and Predicted Values", fontsize=16)
ax[1].grid(True)
plt.tight_layout()
plt.show()

Observations:

  • For Water Consumption, the predicted data seems to match the training and testing data trend even though they are not exactly similar.

Water Consumption (Model Summary)¶

In [62]:
#print summary for Water Consumption
arima_modelWater.summary()
Out[62]:
SARIMAX Results
Dep. Variable: y No. Observations: 397
Model: SARIMAX(0, 1, 1) Log Likelihood -2404.050
Date: Thu, 10 Aug 2023 AIC 4814.100
Time: 07:28:25 BIC 4826.044
Sample: 0 HQIC 4818.832
- 397
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept -0.5743 2.335 -0.246 0.806 -5.151 4.002
ma.L1 -0.5644 0.035 -16.018 0.000 -0.633 -0.495
sigma2 1.097e+04 552.167 19.864 0.000 9886.232 1.21e+04
Ljung-Box (L1) (Q): 3.43 Jarque-Bera (JB): 82.06
Prob(Q): 0.06 Prob(JB): 0.00
Heteroskedasticity (H): 1.60 Skew: -0.02
Prob(H) (two-sided): 0.01 Kurtosis: 5.23


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Observations:

  • Coefficient for L1 MA coefficient is not significant as p-value is more than 0.5.
  • Ljung Box test shows that the residuals are independently distributed.
  • Heteroscedasticity test shows that the residuals do show variance.
  • Jarque-Bera Test shows the data is not normally distributed against an alternative of another distribution.

Water Consumption (Diagnostic Plot)¶

In [63]:
#Plot diagnostic plot
arima_modelWater.plot_diagnostics(figsize=(14,10))
plt.show()

Observations:

  • There are no obvious patterns in the residuals in the standardized residual plot.
  • The histogram has a bell-shaped curve and the QQ plot shows all points lying on the line which means that the data follows a normal distribution.
  • However, the correlogram does not show that 95% of the correlations for lag do not lie exactly within the threshold, which shows that this is not a good model.

All Features (Metrics)¶

In [64]:
print(f'Gas Consumption:')
print(f'Train MAPE: {gas_train_MAPE}%')
print(f'Train RMSE: {gas_train_RMSE}')
print(f'Validation MAPE: {gas_validate_MAPE}%')
print(f'Validation RMSE: {gas_validate_RMSE}')
print()

print(f'Electricity Consumption:')
print(f'Train MAPE: {elec_train_MAPE}%')
print(f'Train RMSE: {elec_train_RMSE}')
print(f'Validation MAPE: {elec_validate_MAPE}%')
print(f'Validation RMSE: {elec_validate_RMSE}')
print()

print(f'Water Consumption:')
print(f'Train MAPE: {water_train_MAPE}%')
print(f'Train RMSE: {water_train_RMSE}')
print(f'Validation MAPE: {water_validate_MAPE}%')
print(f'Validation RMSE: {water_validate_RMSE}')
Gas Consumption:
Train MAPE: 11.742%
Train RMSE: 3.854
Validation MAPE: 9.417%
Validation RMSE: 3.044

Electricity Consumption:
Train MAPE: 4.84%
Train RMSE: 52.197
Validation MAPE: 5.567%
Validation RMSE: 71.416

Water Consumption:
Train MAPE: 22.588%
Train RMSE: 111.539
Validation MAPE: 17.005%
Validation RMSE: 94.14

Observations:

  • Gas Consumption has a good training forecast and a better validation forecast as the training MAPE shows a lower percentage than validation MAPE.
  • Electricity Consumption has a relatively good training forecast but it got worse when forecasting the validation set, implying that this model is not a good fit.
  • Water Consumption has a good training forecast and a better validation forecast as the training MAPE shows a lower percentage than validation MAPE.
  • ALL 3 MODELS CAN BE FURTHER IMPROVED!!!

SARIMA¶

  • Only done for Electricity Consumption & not for Gas Consumption, Water Consumption as there is seasonality present in that data

Electricity Consumption (Walk Forward Validation Using pmdARIMA)¶

  • Clearly, seasonality needs to be accounted for in electricity consumption otherwise forecasting will not be as accurate.
  • This is the difference made when seasonality was taken into account.
In [65]:
#Electricity Consumption Predictions for train and test data
model = ARIMA(order=(2,1,3), seasonal_order=(0,1,1,12))
(electricity_trainPredicted, electricity, pred_full_Elec, arima_modelElec, elec_aic, elec_bic, 
 elec_train_MAPE, elec_train_RMSE, elec_validate_MAPE, elec_validate_RMSE) = walk_forward_validate(
    "Electricity Consumption (MWh)", train_set, test_set, model, energyConsumption_df
)
display(pred_full_Elec.head())
DATE
1990-01-01     -0.052931
1990-02-01    724.996031
1990-03-01    706.578398
1990-04-01    624.434188
1990-05-01    574.661576
dtype: float64
In [66]:
#Visualise train and test time series in a time series plot
fig, ax = plt.subplots(2,1,figsize=(13,6))

#Plot training, testing & predicted values on the graph
train_set["Electricity Consumption (MWh)"].plot(ax=ax[0], label="training", color='dodgerblue')
test_set["Electricity Consumption (MWh)"].plot(ax=ax[0], label="testing", color='green')
pred_full_Elec.plot(ax=ax[0], label='prediction', color="orange")
ax[0].legend()
ax[0].set_title(f"ARIMA with Seasonality for Electricity Consumption (MWh)", fontsize=16)
ax[0].grid(True)

test_set["Electricity Consumption (MWh)"].plot(ax=ax[1], label="testing", color='green')
electricity.plot(ax=ax[1], label='prediction', color="orange")
ax[1].legend()
ax[1].set_title(f"Closer look into Test and Predicted Values", fontsize=16)
ax[1].grid(True)
plt.tight_layout()
plt.show()

Observations:

  • For Electricity Consumption, the predicted data seems to match the actual data trend more accurately than without the seasonal component.
In [67]:
#print summary for Electricity Consumption
arima_modelElec.summary()
Out[67]:
SARIMAX Results
Dep. Variable: y No. Observations: 397
Model: SARIMAX(2, 1, 3)x(0, 1, [1], 12) Log Likelihood -1769.128
Date: Thu, 10 Aug 2023 AIC 3554.256
Time: 07:32:39 BIC 3585.861
Sample: 0 HQIC 3566.792
- 397
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept -0.0732 0.013 -5.458 0.000 -0.099 -0.047
ar.L1 -0.5510 0.089 -6.190 0.000 -0.726 -0.377
ar.L2 0.4136 0.088 4.679 0.000 0.240 0.587
ma.L1 0.1486 0.137 1.081 0.279 -0.121 0.418
ma.L2 -0.9651 0.124 -7.769 0.000 -1.209 -0.722
ma.L3 -0.1822 0.090 -2.034 0.042 -0.358 -0.007
ma.S.L12 -0.7701 0.040 -19.492 0.000 -0.848 -0.693
sigma2 561.0392 64.649 8.678 0.000 434.330 687.749
Ljung-Box (L1) (Q): 0.06 Jarque-Bera (JB): 21.06
Prob(Q): 0.81 Prob(JB): 0.00
Heteroskedasticity (H): 2.92 Skew: -0.09
Prob(H) (two-sided): 0.00 Kurtosis: 4.13


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Observations:

  • Coefficient for L1 Moving Average are not significant as p-value is more than 0.5.
  • Ljung Box test shows that the residuals are independently distributed.
  • Heteroscedasticity test shows that the residuals do not show variance.
  • Jarque-Bera Test shows the data is not normally distributed against an alternative of another distribution.
In [68]:
#Plot diagnostic plot
arima_modelElec.plot_diagnostics(figsize=(14,10))
plt.show()

Observations:

  • There is an obvious pattern in the residuals in the standardized residual plot, showing that there is seasonality existing in this data but no obvious trend which suggests that the model can be a good fit.
  • The histogram has a bell-shaped curve and the QQ plot shows all points lying on the line which means that the data follows a normal distribution.
  • The correlogram shows that 95% of the correlations for lag do lie within the threshold, which shows that this is a good model.
In [69]:
print(f'Electricity Consumption:')
print(f'Train MAPE: {elec_train_MAPE}%')
print(f'Train RMSE: {elec_train_RMSE}')
print(f'Validation MAPE: {elec_validate_MAPE}%')
print(f'Validation RMSE: {elec_validate_RMSE}')
print()
Electricity Consumption:
Train MAPE: 2.499%
Train RMSE: 50.18
Validation MAPE: 2.419%
Validation RMSE: 31.64

Observations:

  • Electricity Consumption has a relatively good training forecast and it improved when forecasting the validation set, implying that this model is a good fit based on the drop in MAPE & RMSE.
  • 2% of MAPE indicates a very small difference between actual & predicted values.
  • However, the difference between train MAPE & validate MAPE is very small so this could improved.

Electricity Consumption (Walk Forward Validation Using statsmodel SARIMA)¶

Now I will try to run the same model but from a different library just to see whether a different result was obtained.

In [70]:
#New walk forward cross validation function since a different library was used which means it has different methods
def walk_forward_validate_sarimax(column, train_set, test_set, model, order, seasonal_order):
    #store predictions
    trainPredictions=[]
    testPredictions=[]
    fullPredictions=[]
    data=train_set[column].values
    
    model_fit = model
    trainPredictions = model_fit.get_prediction(start='1990-01-01', end='2016-05-01').predicted_mean
    for t in test_set[column].values:
        model = SARIMAX(data, order=order, seasonal_order=seasonal_order)
        model_fit = model.fit()
        
        #predicting the next time step
        y_test = model_fit.forecast(steps=1)
        #appends one step ahead prediction
        testPredictions.append(y_test[0])
        
        #adds current value from test set to data array which include actual value from the history
        #to predict the next iteration prediction
        data = np.append(data, t)
    
    #Concatenate full predictions
    fullPredictions=np.concatenate((trainPredictions, testPredictions))
    
    #calculate the aic & bic scores based on the order
    aic=model_fit.aic
    bic=model_fit.bic
    
    #convert everything to dataframe
    trainPredicted = pd.Series(data=trainPredictions, index=train_set.index)
    testPredicted = pd.Series(data=testPredictions, index=test_set.index)
    pred_full_energy = pd.Series(data=fullPredictions, index=energyConsumption_df[column].index)
        
    #calculate the MAPE & RMSE scores for train set and validaion set based on the order
    train_MAPE = round(mean_absolute_percentage_error(train_set[column], trainPredicted)*100,3)
    train_RMSE = round(mean_squared_error(train_set[column], trainPredicted, squared=False),3)
    validate_MAPE = round(mean_absolute_percentage_error(test_set[column], testPredicted)*100, 3)
    validate_RMSE = round(mean_squared_error(test_set[column], testPredicted, squared=False),3)
    
    return(
        trainPredicted, testPredicted, pred_full_energy, aic, bic,
        train_MAPE, train_RMSE, validate_MAPE, validate_RMSE)
In [71]:
#Electricity Consumption Predictions for train and test data
sarimax_model_Elec = SARIMAX(train_set["Electricity Consumption (MWh)"], order=(2,1,3), seasonal_order=(0,1,1,12)).fit()
(electricity_trainPredicted, electricity, pred_full_Elec, elec_aic, elec_bic,
elec_train_MAPE, elec_train_RMSE, elec_validate_MAPE, elec_validate_RMSE)  = walk_forward_validate_sarimax(
    "Electricity Consumption (MWh)", train_set, test_set, sarimax_model_Elec, (2,1,3), (0,1,1,12)
)
display(pred_full_Elec.head())
DATE
1990-01-01      0.000000
1990-02-01    725.048405
1990-03-01    706.617533
1990-04-01    624.492987
1990-05-01    574.722391
dtype: float64
In [72]:
#Visualise train and test time series in a time series plot
fig, ax = plt.subplots(2,1,figsize=(13,6))

#Plot training, testing & predicted values on the graph
train_set["Electricity Consumption (MWh)"].plot(ax=ax[0], label="training", color='dodgerblue')
test_set["Electricity Consumption (MWh)"].plot(ax=ax[0], label="testing", color='green')
pred_full_Elec.plot(ax=ax[0], label='prediction', color="orange")
ax[0].legend()
ax[0].set_title(f"SARIMA for Electricity Consumption (MWh)", fontsize=16)
ax[0].grid(True)

test_set["Electricity Consumption (MWh)"].plot(ax=ax[1], label="testing", color='green')
electricity.plot(ax=ax[1], label='prediction', color="orange")
ax[1].legend()
ax[1].set_title(f"Closer look into Test and Predicted Values", fontsize=16)
ax[1].grid(True)
plt.tight_layout()
plt.show()

Observations:

  • For Electricity Consumption, the predicted data seems to match the actual data trend more accurately than without the seasonal component.
In [73]:
print(f'Electricity Consumption:')
print(f'Train MAPE: {elec_train_MAPE}%')
print(f'Train RMSE: {elec_train_RMSE}')
print(f'Validation MAPE: {elec_validate_MAPE}%')
print(f'Validation RMSE: {elec_validate_RMSE}')
print()
Electricity Consumption:
Train MAPE: 2.536%
Train RMSE: 50.23
Validation MAPE: 2.436%
Validation RMSE: 31.608

Observations:

  • Electricity Consumption has a relatively good training forecast and it improved when forecasting the validation set, implying that this model is a good fit based on the drop in MAPE & RMSE.
  • pmdarima library MAPE for the train and validation shows that it is slightly better than statsmodels.
In [74]:
#print SARIMAX summary for Electricity Consumption
sarimax_model_Elec.summary()
Out[74]:
SARIMAX Results
Dep. Variable: Electricity Consumption (MWh) No. Observations: 317
Model: SARIMAX(2, 1, 3)x(0, 1, [1], 12) Log Likelihood -1373.238
Date: Thu, 10 Aug 2023 AIC 2760.476
Time: 07:34:55 BIC 2786.495
Sample: 01-01-1990 HQIC 2770.884
- 05-01-2016
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 -0.7202 0.144 -4.989 0.000 -1.003 -0.437
ar.L2 0.2542 0.122 2.077 0.038 0.014 0.494
ma.L1 0.3331 0.140 2.382 0.017 0.059 0.607
ma.L2 -0.8977 0.067 -13.439 0.000 -1.029 -0.767
ma.L3 -0.2715 0.101 -2.696 0.007 -0.469 -0.074
ma.S.L12 -0.7267 0.050 -14.551 0.000 -0.825 -0.629
sigma2 473.8149 29.857 15.869 0.000 415.296 532.334
Ljung-Box (L1) (Q): 0.04 Jarque-Bera (JB): 48.50
Prob(Q): 0.84 Prob(JB): 0.00
Heteroskedasticity (H): 2.51 Skew: -0.32
Prob(H) (two-sided): 0.00 Kurtosis: 4.85


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Observations:

  • All coefficients are significant as p-value is less than 0.5.
  • Ljung Box test shows that the residuals are independently distributed.
  • Heteroscedasticity test shows that the residuals do not show variance.
  • Jarque-Bera Test shows the data is not normally distributed against an alternative of another distribution.
In [75]:
#Plot diagnostic plot
fig = sarimax_model_Elec.plot_diagnostics()
ax = fig.axes[0]
ax.set_xlim(250, 557)
ax.set_ylim(-6,4)
ax.axhline(0)
plt.show()

Observations:

  • There is an obvious pattern in the residuals in the standardized residual plot, showing that there is seasonality existing in this data but no trend so the model can be a good fit.
  • The histogram is slightly symmetrical and the QQ plot shows all points lying on the line which means that the data follows a normal distribution.
  • The correlogram shows that 95% of the correlations for lag do lie within the threshold, which shows that this is a good model.

VARMA¶

All Columns¶

  • Fitting all columns into VARMA as Johansen Cointegration Test suggests that all 3 columns have a cointegrated relationship
In [76]:
#Try out p=1, q=1
varma_model = VARMAX(endog=train_set, order=(1,1), enforce_stationary=False, enforce_invertibility=False).fit(disp=False)
#Summary
varma_model.summary()
Out[76]:
Statespace Model Results
Dep. Variable: ['Gas Consumption (tons)', 'Electricity Consumption (MWh)', 'Water Consumption (tons)'] No. Observations: 317
Model: VARMA(1,1) Log Likelihood -4487.343
+ intercept AIC 9028.687
Date: Thu, 10 Aug 2023 BIC 9130.177
Time: 07:34:57 HQIC 9069.227
Sample: 01-01-1990
- 05-01-2016
Covariance Type: opg
Ljung-Box (L1) (Q): 0.83, 2.46, 0.82 Jarque-Bera (JB): 1868.22, 7.44, 75.18
Prob(Q): 0.36, 0.12, 0.36 Prob(JB): 0.00, 0.02, 0.00
Heteroskedasticity (H): 2.11, 2.45, 1.43 Skew: 0.67, 0.35, -0.52
Prob(H) (two-sided): 0.00, 0.00, 0.07 Kurtosis: 14.82, 3.28, 5.15
Results for equation Gas Consumption (tons)
coef std err z P>|z| [0.025 0.975]
intercept 7.9276 2.420 3.275 0.001 3.184 12.672
L1.Gas Consumption (tons) 0.7272 0.075 9.703 0.000 0.580 0.874
L1.Electricity Consumption (MWh) 0.0010 0.001 0.737 0.461 -0.002 0.004
L1.Water Consumption (tons) -0.0051 0.003 -1.839 0.066 -0.010 0.000
L1.e(Gas Consumption (tons)) -0.1175 0.077 -1.532 0.126 -0.268 0.033
L1.e(Electricity Consumption (MWh)) 0.0010 0.005 0.185 0.854 -0.009 0.011
L1.e(Water Consumption (tons)) 0.0090 0.004 2.249 0.025 0.001 0.017
Results for equation Electricity Consumption (MWh)
coef std err z P>|z| [0.025 0.975]
intercept 60.8961 88.367 0.689 0.491 -112.299 234.091
L1.Gas Consumption (tons) 3.1787 2.506 1.268 0.205 -1.734 8.091
L1.Electricity Consumption (MWh) 0.7600 0.072 10.621 0.000 0.620 0.900
L1.Water Consumption (tons) 0.1394 0.095 1.462 0.144 -0.047 0.326
L1.e(Gas Consumption (tons)) -0.3897 2.028 -0.192 0.848 -4.365 3.586
L1.e(Electricity Consumption (MWh)) 0.6118 0.082 7.494 0.000 0.452 0.772
L1.e(Water Consumption (tons)) -0.0808 0.090 -0.901 0.368 -0.257 0.095
Results for equation Water Consumption (tons)
coef std err z P>|z| [0.025 0.975]
intercept 78.2474 59.258 1.320 0.187 -37.895 194.390
L1.Gas Consumption (tons) -1.9881 1.891 -1.051 0.293 -5.695 1.718
L1.Electricity Consumption (MWh) 0.1004 0.043 2.333 0.020 0.016 0.185
L1.Water Consumption (tons) 0.7587 0.082 9.307 0.000 0.599 0.918
L1.e(Gas Consumption (tons)) 3.5529 2.364 1.503 0.133 -1.081 8.187
L1.e(Electricity Consumption (MWh)) -0.1302 0.136 -0.956 0.339 -0.397 0.137
L1.e(Water Consumption (tons)) -0.3076 0.105 -2.934 0.003 -0.513 -0.102
Error covariance matrix
coef std err z P>|z| [0.025 0.975]
sqrt.var.Gas Consumption (tons) 3.6720 0.109 33.842 0.000 3.459 3.885
sqrt.cov.Gas Consumption (tons).Electricity Consumption (MWh) 3.1604 6.860 0.461 0.645 -10.284 16.605
sqrt.var.Electricity Consumption (MWh) 69.9175 4.016 17.408 0.000 62.045 77.790
sqrt.cov.Gas Consumption (tons).Water Consumption (tons) -51.8973 5.403 -9.605 0.000 -62.488 -41.307
sqrt.cov.Electricity Consumption (MWh).Water Consumption (tons) 3.0364 6.911 0.439 0.660 -10.509 16.581
sqrt.var.Water Consumption (tons) 93.9102 3.572 26.293 0.000 86.910 100.911


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Observations:

  • Ljung Box test shows that the residuals are independently distributed for all 3 variables.
  • Heteroscedasticity test shows that the residuals do not show variance for gas & electricity consumption but residuals show variances for water consumption.
  • Jarque-Bera Test shows the data is not normally distributed against an alternative of another distribution for all 3 variables.
In [82]:
#Diagnostics plot for Gas Consumption
fig=varma_model.plot_diagnostics(variable=0)
plt.gcf().suptitle('Gas Consumption - Diagnostics', fontsize=18)
ax = fig.axes[0]
ax.set_xlim(250, 557)
ax.set_ylim(-5,9)
ax.axhline(0)
plt.tight_layout()
plt.show()

Observations for Gas Consumption:

  • There is no obvious pattern in the residuals in the standardized residual plot, suggests that this is a good model.
  • The histogram is slightly symmetrical and the QQ plot shows all points lying on the line which means that the data follows a normal distribution.
  • The correlogram shows that 95% of the correlations for lag do lie within the threshold, which shows that this is a good model.
In [97]:
#Diagnostics plot for Electricity Consumption
fig=varma_model.plot_diagnostics(variable=1)
plt.gcf().suptitle('Electricity Consumption - Diagnostics', fontsize=18)
ax = fig.axes[0]
ax.set_xlim(240, 555)
ax.set_ylim(-5,9)
ax.axhline(0)
plt.tight_layout()
plt.show()

Observations for Electricity Consumption:

  • There is an obvious pattern but no trend in the residuals in the standardized residual plot, showing that this can be a good fit.
  • The histogram is slightly symmetrical and the QQ plot shows all points lying on the line which means that the data follows a normal distribution.
  • However, the correlogram shows that 95% of the correlations for lag do not lie within the threshold, which shows that this is not a good model.
In [96]:
#Diagnostics plot for Water Consumption
fig=varma_model.plot_diagnostics(variable=2)
plt.gcf().suptitle('Water Consumption - Diagnostics', fontsize=18)
ax = fig.axes[0]
ax.set_xlim(240, 555)
ax.set_ylim(-5,9)
ax.axhline(0)
plt.tight_layout()
plt.show()

Observations for Water Consumption:

  • There is no obvious pattern or trend in the residuals in the standardized residual plot, showing that this model can be a good fit.
  • The histogram is slightly symmetrical and the QQ plot shows all points lying on the line which means that the data follows a normal distribution.
  • The correlogram shows that 95% of the correlations for lag do lie within the threshold, which shows that this is a good model.
In [78]:
# Walk forward cross validation for VARMA
def walk_forward_validate_varma(columns, train_set, test_set, model, order):
    #store predictions
    trainPredictions=[]
    testPredictions=[]
    fullPredictions=[]
    data=train_set[columns].values
    
    model_fit = model
    trainPredictions = model_fit.get_prediction(start='1990-01-01', end='2016-05-01').predicted_mean
    for t in test_set[columns].values:
        model = VARMAX(endog=data, order=order, enforce_stationary=False, enforce_invertibility=False)
        model_fit = model.fit(disp=False)
        
        #predicting the next time step
        y_test = model_fit.forecast(steps=1)
        #appends one step ahead prediction
        testPredictions.append(y_test[0])
        
        #adds current value from test set to data array which include actual value from the history
        #to predict the next iteration prediction
        data = np.append(data, [t], axis=0)
    
    #Concatenate full predictions
    fullPredictions=np.concatenate((trainPredictions, testPredictions))
    
    #calculate the aic & bic scores based on the order
    aic=model_fit.aic
    bic=model_fit.bic
    
    #convert everything to dataframe
    trainPredicted = pd.DataFrame(data=trainPredictions, columns=columns, index=train_set.index)
    testPredicted = pd.DataFrame(data=testPredictions, columns=columns, index=test_set.index)
    pred_full_energy = pd.DataFrame(data=fullPredictions, columns=columns, index=energyConsumption_df.index)
        
    #calculate the MAPE & RMSE scores for train set and validaion set based on the order
    train_MAPE = round(mean_absolute_percentage_error(train_set[columns], trainPredicted)*100,3)
    train_RMSE = round(mean_squared_error(train_set[columns], trainPredicted, squared=False),3)
    validate_MAPE = round(mean_absolute_percentage_error(test_set[columns], testPredicted)*100, 3)
    validate_RMSE = round(mean_squared_error(test_set[columns], testPredicted, squared=False),3)
    
    return(
        trainPredicted, testPredicted, pred_full_energy, model, aic, bic,
        train_MAPE, train_RMSE, validate_MAPE, validate_RMSE)
In [98]:
#VARMA Predictions for train and test data
columns = ['Gas Consumption (tons)', 'Electricity Consumption (MWh)', 'Water Consumption (tons)']
varma_model = VARMAX(endog=train_set[columns], order=(1,1)).fit(disp=False)
(trainPredicted, testPredicted, pred_full, varma_model, aic, bic,
v_Train_MAPE, v_Train_RMSE, v_Validate_MAPE, v_validate_RMSE)  = walk_forward_validate_varma(
    columns, train_set, test_set, varma_model, (1,1)
)
display(pred_full.head())
Gas Consumption (tons) Electricity Consumption (MWh) Water Consumption (tons)
DATE
1990-01-01 23.486793 839.552364 480.699104
1990-02-01 19.871191 732.799785 507.893358
1990-03-01 18.466696 720.768855 556.041754
1990-04-01 18.325582 613.990708 509.973274
1990-05-01 19.593443 606.845628 493.557555
In [99]:
fig,ax = plt.subplots(3, 1)
fig.suptitle("VARMA Actual VS Predicted", fontsize=14, fontweight="bold")

#Plot actual & predicted values on the graph
index=0
for column in energyConsumption_df.columns:
    energyConsumption_df[column].plot(ax=ax[index], label="actual", color='dodgerblue')
    pred_full[column].plot(ax=ax[index], label='predicted', linestyle='--', color='orange')
    ax[index].legend()
    ax[index].set_title(f'{column}')
    index+=1
plt.tight_layout()
plt.show()

Observations:

  • For all 3 columns, it seems like the predicted data does not seem to match the actual data.

VARMA (Impulse-Response Function)¶

  • Provides insights into how an impulse to one variable affects both that variable and other variables in the system over multiple time periods by showing how the variables in the system respond to this impulse over time.
  • It captures the lagged effect of the impulse on each variable.
  • For each variable in the system, the IRF plots how the variable's value deviates from the normal path due to impulse.
  • The direction and magnitude of the impact can be observed from the shape of the IRF curve. If the curve rises, it suggests a positive response, and if it falls, it suggests a negative response.
In [109]:
#Impulse-Response Function
varma_model = VARMAX(endog=train_set[columns], order=(1,1)).fit(disp=False)

gas_irfs = varma_model.impulse_responses(24, impulse=0, orthogonalized=True)
electricity_irfs = varma_model.impulse_responses(24, impulse=1, orthogonalized=True)
water_irfs = varma_model.impulse_responses(24, impulse=2, orthogonalized=True)

#Plot impulse reaction graph for better visualisation
fig, ax = plt.subplots(3, 1, figsize=(15,9))

index=0
irfs=[gas_irfs, electricity_irfs, water_irfs]
for col in columns:
    ax[index].plot(range(0,25), irfs[index], marker='o', label=["Gas","Electricity","Water"])
    ax[index].set_xlabel("Time Steps")
    ax[index].set_ylabel("Impulse-Response Value")
    ax[index].set_title(f"Impulse-Response Function - {col}")
    ax[index].legend()
    ax[index].grid(True)
    index+=1

plt.tight_layout()
plt.show()

Observations:

  • In response to Gas Consumption, Water Consumption has a negative response which shows that Gas Consumption do indeed pose a significant causality to Water Consumption, as supported by the Granger's Causility Test.
  • Electricity Consumption has an autocorrelation with itself over time.
  • Gradually, the IRF converges to zero gradually as the impulse effects became to die out.

VARMA (Durbin Watson Test)¶

Determines whether there is evidence of autocorrelation (serial correlation) in the residuals of a regression analysis. Checks for the presence of first-order autocorrelation. Test statistic take the values between 0 and 4. A value close to 2 suggests no autocorrelation (null hypothesis cannot be rejected). A value significantly less than 2 indicates positive autocorrelation (rejection of null hypothesis in favor of alternative). A value significantly greater than 2 indicates negative autocorrelation (rejection of null hypothesis in favor of alternative)

  • $H_0$: There is no first-order autocorrelation in the residuals.
  • $H_1$: There is first-order autocorrelation in the residuals.
In [110]:
#Conduct durbin watson test
def durbin_watson_test(df, resid=None):
    cols, stat = [], []
    out = durbin_watson(resid)
    for col, val in zip(df.columns, out):
        cols.append(col)
        stat.append(round(val, 2))
    dw_test = pd.DataFrame(stat, index=cols, columns=['Durbin Watson Test Statistic'])
    return dw_test
In [111]:
#Perform durbin watson test
dw_results = durbin_watson_test(df=energyConsumption_df, resid=varma_model.resid)
display(dw_results)
Durbin Watson Test Statistic
Gas Consumption (tons) 2.01
Electricity Consumption (MWh) 1.78
Water Consumption (tons) 2.00

Observations:

  • For Gas Consumption & Water Consumption, Durbin Watson Test Statistic shows that there is no autocorrelation.
  • For Electricity Consumption, Durbin Watson Test Statistic shows that there is a positive autocorrelation.
In [112]:
#Score Metrics
print(f'Energy Consumption Score Metrics:')
print(f'Train MAPE: {v_Train_MAPE}%')
print(f'Train RMSE: {v_Train_RMSE}')
print(f'Validation MAPE: {v_Validate_MAPE}%')
print(f'Validation RMSE: {v_validate_RMSE}')
print()
Energy Consumption Score Metrics:
Train MAPE: 13.028%
Train RMSE: 55.019
Validation MAPE: 11.149%
Validation RMSE: 57.356

Observations:

  • Generally, even though predicted out-sample data are much better predicted than training in-sample data based on the MAPE but it performed worse based on the RMSE.
  • Therefore, there should be a better order for the model that would give a better result.

5) Model Improvement¶

  • Trying out multiple orders for each model to check for best order to use based on score metrics
  • We can interpret the acf and pacf plot for p and q parameter but using my own function which helps us to find the order that gives us the lowest bic and we shall look at the MAPE & RMSE score to ensure that the validation prediction is accurate

ARIMA¶

Gas Consumption¶

In [113]:
scoreMetrics_values={}

#Values of p & q
p_values = np.arange(0, 3)
q_values = np.arange(0, 8)

for i in list(product(p_values,q_values)):
    order=(i[0], 1, i[1])
    model = ARIMA(order=order)
    (gas_trainPredicted, gas, pred_full_Gas, arima_modelGas, gas_aic, gas_bic, 
     gas_train_MAPE, gas_train_RMSE, gas_validate_MAPE, gas_validate_RMSE) = walk_forward_validate(
        "Gas Consumption (tons)", train_set, test_set, model, energyConsumption_df
    )

    #concatenate order as a string
    order_string=f'({i[0]},1,{i[1]})'
    #store the score metrics in the dictionaries
    scoreMetrics_values[order_string] = {'AIC': round(gas_aic,3), 'BIC': round(gas_bic,3), 'Train MAPE': gas_train_MAPE,
                                        'Train RMSE': gas_train_RMSE, 'Validate MAPE': gas_validate_MAPE, 
                                         'Validate RMSE': gas_validate_RMSE}
    
#Convert the dictionary to dataframe
scoreMetrics_Gas_df = pd.DataFrame.from_dict(scoreMetrics_values, orient='index').sort_values(
    by=["BIC","Validate MAPE", "Train MAPE","Validate RMSE","Train RMSE"], 
    ascending=[True, True, True,True,True])
display(scoreMetrics_Gas_df)
AIC BIC Train MAPE Train RMSE Validate MAPE Validate RMSE
(1,1,1) 2116.039 2131.965 11.345 3.716 9.344 2.961
(2,1,1) 2116.763 2136.671 11.344 3.708 9.426 2.969
(1,1,2) 2116.935 2136.843 11.329 3.710 9.319 2.968
(0,1,3) 2118.446 2138.353 11.598 3.713 9.605 2.991
(2,1,2) 2118.041 2141.930 11.402 3.703 9.395 2.978
(1,1,3) 2118.526 2142.414 11.403 3.704 9.408 2.992
(0,1,2) 2127.328 2143.254 11.823 3.772 9.496 2.976
(0,1,4) 2119.495 2143.383 11.466 3.706 9.514 3.005
(0,1,5) 2119.048 2146.918 11.352 3.699 9.371 2.979
(2,1,3) 2120.028 2147.897 11.412 3.703 9.424 2.982
(1,1,4) 2121.440 2149.310 11.549 3.705 9.625 3.006
(0,1,6) 2120.609 2152.460 11.414 3.694 9.422 2.995
(2,1,4) 2122.012 2153.863 11.378 3.699 9.434 3.007
(1,1,5) 2122.919 2154.770 11.452 3.701 9.559 3.016
(0,1,1) 2143.306 2155.250 11.742 3.854 9.417 3.044
(2,1,6) 2117.259 2157.073 11.141 3.659 9.428 2.999
(1,1,6) 2121.374 2157.207 11.283 3.690 9.418 2.981
(0,1,7) 2122.605 2158.437 11.415 3.693 9.442 3.003
(2,1,5) 2123.838 2159.670 11.380 3.699 9.431 3.004
(1,1,7) 2122.061 2161.875 11.318 3.680 9.495 3.004
(2,1,7) 2119.095 2162.890 11.157 3.658 9.418 3.006
(2,1,0) 2157.816 2173.741 11.318 3.917 9.233 3.060
(1,1,0) 2166.866 2178.810 11.221 3.958 9.260 3.151
(0,1,0) 2205.936 2213.899 11.105 4.180 9.357 3.192
In [114]:
lowestGasBIC_order = scoreMetrics_Gas_df["BIC"].idxmin()
lowestGasBIC_value = scoreMetrics_Gas_df["BIC"].min()
print(f"Order with Lowest BIC value: {lowestGasBIC_order}")
print(f"Lowest BIC value: {lowestGasBIC_value}")

lowestGasBIC_cols = scoreMetrics_Gas_df.loc[lowestGasBIC_order]
display(lowestGasBIC_cols)
Order with Lowest BIC value: (1,1,1)
Lowest BIC value: 2131.965
AIC              2116.039
BIC              2131.965
Train MAPE         11.345
Train RMSE          3.716
Validate MAPE       9.344
Validate RMSE       2.961
Name: (1,1,1), dtype: float64

Observations:

  • ARIMA order of (1,1,1) has the lowest BIC value so it is the best order to be used for Gas Consumption.
  • Validate MAPE is 2% lower than the train MAPE so it means there is a lower different in errors between the test and predicted data than the train and predicted data.
  • Validate RMSE is also lower than the train RMSE.
In [115]:
Gas_ARIMA_model = ARIMA(order=(1,1,1)).fit(energyConsumption_df["Gas Consumption (tons)"])
Gas_ARIMA_model.plot_diagnostics(figsize=(14,10))
plt.show()

Observations:

  • There is no obvious pattern in the residuals in the standardized residual plot, suggests that this is a good model.
  • The histogram is slightly symmetrical and the QQ plot shows all points lying on the line which means that the data follows a normal distribution.
  • The correlogram shows that 95% of the correlations for lag do lie within the threshold, which shows that this is a good model.
In [116]:
Gas_ARIMA_model.summary()
Out[116]:
SARIMAX Results
Dep. Variable: y No. Observations: 397
Model: SARIMAX(1, 1, 1) Log Likelihood -1054.019
Date: Thu, 10 Aug 2023 AIC 2116.039
Time: 08:13:24 BIC 2131.965
Sample: 01-01-1990 HQIC 2122.348
- 01-01-2023
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 0.0133 0.018 0.728 0.467 -0.023 0.049
ar.L1 0.4336 0.037 11.732 0.000 0.361 0.506
ma.L1 -0.9013 0.032 -28.510 0.000 -0.963 -0.839
sigma2 11.9797 0.429 27.933 0.000 11.139 12.820
Ljung-Box (L1) (Q): 0.22 Jarque-Bera (JB): 2255.67
Prob(Q): 0.64 Prob(JB): 0.00
Heteroskedasticity (H): 2.21 Skew: 0.57
Prob(H) (two-sided): 0.00 Kurtosis: 14.64


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Observations:

  • All coefficients are significant as p-value are less than 0.5.
  • Ljung Box test shows that the residuals are independently distributed.
  • Heteroscedasticity test shows that the residuals do not show variance.
  • Jarque-Bera Test shows the data is not normally distributed against an alternative of another distribution.
In [135]:
#Gas Consumption Predictions for train and test data
model = ARIMA(order=(1,1,1))
(gas_trainPredicted, gas, pred_full_Gas, arima_modelGas, gas_aic, gas_bic, 
 gas_train_MAPE, gas_train_RMSE, gas_validate_MAPE, gas_validate_RMSE) = walk_forward_validate(
    "Gas Consumption (tons)", train_set, test_set, model, energyConsumption_df
)
display(pred_full_Gas.head())
DATE
1990-01-01    -0.000792
1990-02-01    17.999123
1990-03-01    16.423710
1990-04-01    17.258491
1990-05-01    18.252177
dtype: float64
In [136]:
#Visualise train and test time series in a time series plot
fig, ax = plt.subplots(2,1,figsize=(13,6))

#Plot training, testing & predicted values on the graph
train_set["Gas Consumption (tons)"].plot(ax=ax[0], label="training", color='dodgerblue')
test_set["Gas Consumption (tons)"].plot(ax=ax[0], label="testing", color='green')
pred_full_Gas.plot(ax=ax[0], label='prediction', color="orange")
ax[0].legend()
ax[0].set_title(f"ARIMA for Gas Consumption (tons)", fontsize=16)
ax[0].set_yticks(np.arange(0,51,10))
ax[0].grid(True)

#Focus on the predicted values and the test values
test_set["Gas Consumption (tons)"].plot(ax=ax[1], label="testing", color='green')
gas.plot(ax=ax[1], label='prediction', color="orange")
ax[1].legend()
ax[1].set_title(f"Closer look into Test and Predicted Values", fontsize=16)
ax[1].set_yticks(np.arange(0,51,10))
ax[1].grid(True)
plt.tight_layout()
plt.show()

Observations:

  • For Gas Consumption, the predicted data seems to match the training and testing data trend even though they are not exactly similar.

Water Consumption¶

In [119]:
scoreMetrics_values={}

#Values of p & q
p_values = np.arange(0, 4)
q_values = np.arange(0, 5)

for i in list(product(p_values,q_values)):
    order=(i[0], 1, i[1])
    model = ARIMA(order=order)
    (water_trainPredicted, water, pred_full_Water, arima_modelWater, water_aic, water_bic, 
     water_train_MAPE, water_train_RMSE, water_validate_MAPE, water_validate_RMSE) = walk_forward_validate(
        "Water Consumption (tons)", train_set, test_set, model, energyConsumption_df
    )

    #concatenate order as a string
    order_string=f'({i[0]},1,{i[1]})'
    #store the score metrics in the dictionaries
    scoreMetrics_values[order_string] = {'AIC': round(water_aic,3), 'BIC': round(water_bic,3), 
                                         'Train MAPE': water_train_MAPE, 'Train RMSE': water_train_RMSE, 
                                         'Validate MAPE': water_validate_MAPE, 'Validate RMSE': water_validate_RMSE}
    
#Convert the dictionary to dataframe
scoreMetrics_Water_df = pd.DataFrame.from_dict(scoreMetrics_values, orient='index').sort_values(
    by=["BIC","Validate MAPE", "Train MAPE","Validate RMSE","Train RMSE"], 
    ascending=[True, True, True,True,True])
display(scoreMetrics_Water_df)
AIC BIC Train MAPE Train RMSE Validate MAPE Validate RMSE
(1,1,1) 4790.987 4806.913 22.076 108.206 17.147 91.563
(1,1,2) 4789.982 4809.889 22.176 108.025 16.839 90.383
(2,1,1) 4790.225 4810.132 22.157 108.043 16.850 90.430
(2,1,2) 4791.924 4815.813 22.174 107.988 16.890 90.617
(1,1,3) 4791.933 4815.822 22.185 108.001 16.879 90.562
(3,1,1) 4792.014 4815.903 22.172 108.025 16.875 90.449
(0,1,3) 4796.227 4816.134 22.110 108.422 17.312 93.050
(0,1,2) 4802.317 4818.243 22.107 109.198 17.670 95.287
(0,1,4) 4794.376 4818.264 22.071 108.164 17.013 91.451
(2,1,3) 4791.966 4819.836 22.175 107.806 16.771 90.067
(1,1,4) 4793.902 4821.772 22.123 108.245 17.087 91.325
(3,1,2) 4793.927 4821.797 22.179 107.987 16.867 90.601
(2,1,4) 4793.793 4825.644 22.225 107.855 16.772 90.078
(0,1,1) 4814.100 4826.044 22.588 111.539 17.005 94.140
(3,1,3) 4795.863 4827.714 21.940 107.465 17.229 92.607
(3,1,4) 4795.948 4831.781 21.751 107.564 16.918 90.987
(3,1,0) 4821.964 4841.871 22.517 112.380 16.675 93.127
(2,1,0) 4829.537 4845.462 22.433 113.995 16.203 92.509
(1,1,0) 4843.533 4855.478 22.763 116.527 16.235 92.506
(0,1,0) 4888.177 4896.140 22.818 123.119 16.501 97.938
In [120]:
lowestWaterBIC_order = scoreMetrics_Water_df["BIC"].idxmin()
lowestWaterBIC_value = scoreMetrics_Water_df["BIC"].min()
print(f"Order with Lowest BIC value: {lowestWaterBIC_order}")
print(f"Lowest BIC value: {lowestWaterBIC_value}")

lowestWaterBIC_cols = scoreMetrics_Water_df.loc[lowestWaterBIC_order]
display(lowestWaterBIC_cols)
Order with Lowest BIC value: (1,1,1)
Lowest BIC value: 4806.913
AIC              4790.987
BIC              4806.913
Train MAPE         22.076
Train RMSE        108.206
Validate MAPE      17.147
Validate RMSE      91.563
Name: (1,1,1), dtype: float64

Observations:

  • ARIMA order of (1,1,1) has the lowest BIC value so it is the best order to be used for Water Consumption.
  • Validate MAPE is 4% lower than the train MAPE so it means there is a lower different in errors between the test and predicted data than the train and predicted data.
  • Validate RMSE is also lower than the train RMSE.
In [121]:
#Diagnostic plot
Water_ARIMA_model = ARIMA(order=(1,1,1)).fit(energyConsumption_df["Water Consumption (tons)"])
Water_ARIMA_model.plot_diagnostics(figsize=(14,10))
plt.show()

Observations:

  • There is no obvious pattern in the residuals in the standardized residual plot, suggests that this is a good model.
  • The histogram is slightly symmetrical and the QQ plot shows all points lying on the line which means that the data follows a normal distribution.
  • The correlogram shows that 95% of the correlations for lag do lie within the threshold, which shows that this is a good model.
In [122]:
Water_ARIMA_model.summary()
Out[122]:
SARIMAX Results
Dep. Variable: y No. Observations: 397
Model: SARIMAX(1, 1, 1) Log Likelihood -2391.494
Date: Thu, 10 Aug 2023 AIC 4790.987
Time: 08:23:35 BIC 4806.913
Sample: 01-01-1990 HQIC 4797.296
- 01-01-2023
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept -0.0600 0.395 -0.152 0.879 -0.834 0.714
ar.L1 0.4397 0.045 9.737 0.000 0.351 0.528
ma.L1 -0.9294 0.024 -39.224 0.000 -0.976 -0.883
sigma2 1.028e+04 523.320 19.635 0.000 9249.795 1.13e+04
Ljung-Box (L1) (Q): 0.51 Jarque-Bera (JB): 128.13
Prob(Q): 0.48 Prob(JB): 0.00
Heteroskedasticity (H): 1.50 Skew: -0.30
Prob(H) (two-sided): 0.02 Kurtosis: 5.72


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Observations:

  • All coefficients are significant as p-value are less than 0.5.
  • Ljung Box test shows that the residuals are not independently distributed.
  • Heteroscedasticity test shows that the residuals do not show variance.
  • Jarque-Bera Test shows the data is not normally distributed against an alternative of another distribution.
In [123]:
#Water Consumption Predictions for train and test data
model = ARIMA(order=(1,1,1))
(water_trainPredicted, water, pred_full_Water, arima_modelWater, water_aic, water_bic, 
 water_train_MAPE, water_train_RMSE, water_validate_MAPE, water_validate_RMSE) = walk_forward_validate(
    "Water Consumption (tons)", train_set, test_set, model, energyConsumption_df
)
display(pred_full_Water.head())
DATE
1990-01-01      0.207982
1990-02-01    546.711694
1990-03-01    611.921085
1990-04-01    536.728785
1990-05-01    534.288835
dtype: float64
In [124]:
#Visualise train and test time series in a time series plot
fig, ax = plt.subplots(2,1,figsize=(13,6))

#Plot training, testing & predicted values on the graph
train_set["Water Consumption (tons)"].plot(ax=ax[0], label="training", color='dodgerblue')
test_set["Water Consumption (tons)"].plot(ax=ax[0], label="testing", color='green')
pred_full_Water.plot(ax=ax[0], label='prediction', color="orange")
ax[0].legend()
ax[0].set_title(f"ARIMA for Water Consumption (tons)", fontsize=16)
ax[0].grid(True)

test_set["Water Consumption (tons)"].plot(ax=ax[1], label="testing", color='green')
water.plot(ax=ax[1], label='prediction', color="orange")
ax[1].legend()
ax[1].set_title(f"Closer look into Test and Predicted Values", fontsize=16)
ax[1].grid(True)
plt.tight_layout()
plt.show()

Observations:

  • For Water Consumption, the predicted data seems to match the training and testing data trend even though they are not exactly similar.

SARIMA¶

Electricity Consumption pmdarima¶

In [158]:
scoreMetrics_values={}

#Values of p, q, P & Q
p_values = np.arange(0, 3)
q_values = np.arange(0, 4)
P_values = np.arange(0, 3)
Q_values = np.arange(0, 3)

for i in list(product(p_values, q_values, P_values, Q_values)):
    order = (i[0], 1, i[1])
    seasonal_order = (i[2], 1, i[3], 12)
    model = ARIMA(order=order, seasonal_order=seasonal_order)
    (electricity_trainPredicted, electricity, pred_full_Electricity, arima_modelElectricity, elec_aic, elec_bic,
    elec_train_MAPE, elec_train_RMSE, elec_validate_MAPE, elec_validate_RMSE) = walk_forward_validate(
        "Electricity Consumption (MWh)", train_set, test_set, model, energyConsumption_df
    )
    
    #concatenate order as a string
    order_string=f'({i[0]},1,{i[1]}), ({i[2]},1,{i[3]},12)'
    
    #store the score metrics in the dictionaries
    scoreMetrics_values[order_string] = {'AIC': round(elec_aic,3), 'BIC': round(elec_bic,3), 
                                         'Train MAPE': elec_train_MAPE, 'Train RMSE': elec_train_RMSE, 
                                         'Validate MAPE': elec_validate_MAPE, 'Validate RMSE': elec_validate_RMSE}
    
#Convert the dictionary to dataframe
scoreMetrics_Electricity_df = pd.DataFrame.from_dict(scoreMetrics_values, orient='index').sort_values(
    by=["BIC","Validate MAPE", "Train MAPE","Validate RMSE","Train RMSE"], 
    ascending=[True, True, True,True,True])
display(scoreMetrics_Electricity_df.head(15))
AIC BIC Train MAPE Train RMSE Validate MAPE Validate RMSE
(1,1,1), (2,1,1,12) 3540.218 3567.872 2.486 50.117 2.303 30.474
(1,1,1), (0,1,1,12) 3550.888 3570.641 2.503 50.214 2.394 31.265
(2,1,1), (2,1,1,12) 3540.350 3571.955 2.482 50.101 2.280 30.339
(1,1,2), (2,1,1,12) 3541.214 3572.819 2.485 50.092 2.290 30.391
(0,1,3), (2,1,1,12) 3543.050 3574.655 2.507 50.128 2.281 30.494
(1,1,2), (0,1,1,12) 3551.175 3574.879 2.498 50.185 2.413 31.561
(2,1,1), (0,1,1,12) 3551.651 3575.355 2.503 50.199 2.404 31.449
(1,1,1), (0,1,2,12) 3551.755 3575.459 2.507 50.211 2.375 31.396
(1,1,1), (1,1,1,12) 3552.241 3575.945 2.506 50.212 2.382 31.322
(0,1,2), (2,1,1,12) 3548.394 3576.048 2.538 50.197 2.318 30.942
(1,1,2), (2,1,2,12) 3540.723 3576.279 2.471 50.067 2.324 30.547
(1,1,1), (1,1,2,12) 3550.219 3577.873 2.506 50.186 2.341 31.043
(2,1,2), (2,1,1,12) 3542.989 3578.545 2.476 50.074 2.294 30.325
(2,1,1), (2,1,2,12) 3543.524 3579.079 2.470 50.071 2.322 30.700
(1,1,2), (0,1,2,12) 3552.451 3580.106 2.505 50.185 2.399 31.488
In [159]:
lowestElectricityBIC_order = scoreMetrics_Electricity_df["BIC"].idxmin()
lowestElectricityBIC_value = scoreMetrics_Electricity_df["BIC"].min()
print(f"Order with Lowest BIC value: {lowestElectricityBIC_order}")
print(f"Lowest BIC value: {lowestElectricityBIC_value}")

lowestElectricityBIC_cols = scoreMetrics_Electricity_df.loc[lowestElectricityBIC_order]
display(lowestElectricityBIC_cols)
Order with Lowest BIC value: (1,1,1), (2,1,1,12)
Lowest BIC value: 3567.872
AIC              3540.218
BIC              3567.872
Train MAPE          2.486
Train RMSE         50.117
Validate MAPE       2.303
Validate RMSE      30.474
Name: (1,1,1), (2,1,1,12), dtype: float64

Observations:

  • SARIMA order of (1,1,1), (2,1,1,12) has the lowest BIC value and a low MAPE for Water Consumption.
  • Validate MAPE is 0.18% lower than the train MAPE so it means there is a lower different in errors between the test and predicted data than the train and predicted data.
  • Validate RMSE is also lower than the train RMSE.
In [160]:
scoreMetrics_Electricity_df.sort_values(
    by=["Validate MAPE", "BIC" ,"Train MAPE","Validate RMSE","Train RMSE"], 
    ascending=[True, True, True,True,True])
Out[160]:
AIC BIC Train MAPE Train RMSE Validate MAPE Validate RMSE
(2,1,1), (2,1,1,12) 3540.350 3571.955 2.482 50.101 2.280 30.339
(0,1,3), (2,1,1,12) 3543.050 3574.655 2.507 50.128 2.281 30.494
(2,1,2), (2,1,2,12) 3541.906 3581.413 2.473 50.075 2.281 30.264
(1,1,2), (2,1,1,12) 3541.214 3572.819 2.485 50.092 2.290 30.391
(2,1,2), (2,1,1,12) 3542.989 3578.545 2.476 50.074 2.294 30.325
... ... ... ... ... ... ...
(0,1,1), (0,1,0,12) 3742.753 3754.605 3.042 53.297 3.001 40.805
(1,1,0), (0,1,0,12) 3764.756 3776.608 3.124 53.772 3.016 41.643
(0,1,0), (0,1,0,12) 3780.555 3788.457 3.157 54.117 3.021 42.591
(2,1,0), (1,1,0,12) 3678.675 3698.429 2.827 51.804 3.033 38.875
(0,1,1), (1,1,0,12) 3693.682 3709.485 2.832 52.044 3.104 39.976

108 rows × 6 columns

In [161]:
lowestElectricityMAPE_order = scoreMetrics_Electricity_df["Validate MAPE"].idxmin()
lowestElectricityMAPE_value = scoreMetrics_Electricity_df["Validate MAPE"].min()
print(f"Order with Lowest Validate MAPE value: {lowestElectricityMAPE_order}")
print(f"Lowest Validate MAPE value: {lowestElectricityMAPE_value}")

lowestElectricityMAPE_cols = scoreMetrics_Electricity_df.loc[lowestElectricityMAPE_order]
display(lowestElectricityMAPE_cols)
Order with Lowest Validate MAPE value: (2,1,1), (2,1,1,12)
Lowest Validate MAPE value: 2.28
AIC              3540.350
BIC              3571.955
Train MAPE          2.482
Train RMSE         50.101
Validate MAPE       2.280
Validate RMSE      30.339
Name: (2,1,1), (2,1,1,12), dtype: float64

Observations:

  • Between SARIMA order of (2,1,1), (2,1,1,12) which has the lowest MAPE value and the previous order which has the lowest BIC for Water Consumption, I pick (2,1,1), (2,1,1,12) as the other metrics are relatively lower as well.
  • Validate MAPE is 0.2% lower than the train MAPE so it means there is a lower different in errors between the test and predicted data than the train and predicted data.
  • Validate RMSE is also lower than the train RMSE.
In [162]:
#Electricity Consumption Predictions for train and test data
model = ARIMA(order=(2,1,1), seasonal_order=(2,1,1,12))
(electricity_trainPredicted, electricity, pred_full_Elec, arima_modelElec, elec_aic, elec_bic, 
 elec_train_MAPE, elec_train_RMSE, elec_validate_MAPE, elec_validate_RMSE) = walk_forward_validate(
    "Electricity Consumption (MWh)", train_set, test_set, model, energyConsumption_df
)
display(pred_full_Elec.head())
DATE
1990-01-01     -0.054315
1990-02-01    724.986589
1990-03-01    706.595080
1990-04-01    624.434714
1990-05-01    574.655182
dtype: float64
In [163]:
#Visualise train and test time series in a time series plot
fig, ax = plt.subplots(2,1,figsize=(13,6))

#Plot training, testing & predicted values on the graph
train_set["Electricity Consumption (MWh)"].plot(ax=ax[0], label="training", color='dodgerblue')
test_set["Electricity Consumption (MWh)"].plot(ax=ax[0], label="testing", color='green')
pred_full_Elec.plot(ax=ax[0], label='prediction', color="orange")
ax[0].legend()
ax[0].set_title(f"SARIMA for Electricity Consumption (MWh)", fontsize=16)
ax[0].grid(True)

test_set["Electricity Consumption (MWh)"].plot(ax=ax[1], label="testing", color='green')
electricity.plot(ax=ax[1], label='prediction', color="orange")
ax[1].legend()
ax[1].set_title(f"Closer look into Test and Predicted Values", fontsize=16)
ax[1].grid(True)
plt.tight_layout()
plt.show()

Observations:

  • For Electricity Consumption, the predicted data seems to match the actual data trend very accurately.
In [165]:
#print summary for Electricity Consumption
arima_modelElec.summary()
Out[165]:
SARIMAX Results
Dep. Variable: y No. Observations: 397
Model: SARIMAX(2, 1, 1)x(2, 1, 1, 12) Log Likelihood -1762.175
Date: Fri, 11 Aug 2023 AIC 3540.350
Time: 05:46:57 BIC 3571.955
Sample: 0 HQIC 3552.886
- 397
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept -0.0413 0.015 -2.840 0.005 -0.070 -0.013
ar.L1 0.5426 0.047 11.472 0.000 0.450 0.635
ar.L2 -0.0619 0.050 -1.244 0.213 -0.160 0.036
ma.L1 -0.9712 0.016 -62.000 0.000 -1.002 -0.941
ar.S.L12 -0.0511 0.061 -0.843 0.399 -0.170 0.068
ar.S.L24 -0.2547 0.064 -4.003 0.000 -0.379 -0.130
ma.S.L12 -0.6733 0.059 -11.358 0.000 -0.789 -0.557
sigma2 544.1684 32.582 16.701 0.000 480.308 608.029
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 23.47
Prob(Q): 0.98 Prob(JB): 0.00
Heteroskedasticity (H): 2.90 Skew: 0.05
Prob(H) (two-sided): 0.00 Kurtosis: 4.21


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Observations:

  • All coefficients are significant as all p-value are less than 0.5.
  • Ljung Box test shows that the residuals are independently distributed, especially with a p-value so close to 1, suggesting it is a good model.
  • Heteroscedasticity test shows that the residuals do not show variance.
  • Jarque-Bera Test shows the data is not normally distributed against an alternative of another distribution.
In [164]:
#Plot diagnostic plot
arima_modelElec.plot_diagnostics(figsize=(14,10))
plt.show()

Observations:

  • There is no obvious trend in the residuals in the standardized residual plot, showing that this can be a good fit.
  • The histogram is symmetrical and the QQ plot shows all points lying on the line which means that the data follows a normal distribution.
  • The correlogram shows that 95% of the correlations for lag do lie within the threshold, which shows that this is a good model.

VARMA¶

All Columns¶

  • As mentioned, I will fit all columns into VARIMA to find the best suitable order as Johansen Test suggests that all variables have a cointergated relationship with one another.
In [126]:
scoreMetrics_values={}

#Values of p & q
p_values = np.arange(0, 5)
q_values = np.arange(0, 5)
columns = ['Gas Consumption (tons)', 'Electricity Consumption (MWh)', 'Water Consumption (tons)']

for i in list(product(p_values, q_values)):
    order = (i[0], i[1])
    if order!=(0, 0):
        varma_model = VARMAX(endog=train_set[columns], order=order).fit(disp=False)
        (trainPredicted, testPredicted, pred_full, varma_model, aic, bic,
        v_Train_MAPE, v_Train_RMSE, v_Validate_MAPE, v_validate_RMSE)  = walk_forward_validate_varma(
            columns, train_set, test_set, varma_model, order
        )
    
        #concatenate order as a string
        order_string=f'({i[0]},{i[1]})'
    
        #store the score metrics in the dictionaries
        scoreMetrics_values[order_string] = {'AIC': round(aic,3), 'BIC': round(bic,3), 
                                             'Train MAPE': v_Train_MAPE, 'Train RMSE': v_Train_RMSE, 
                                             'Validate MAPE': v_Validate_MAPE, 'Validate RMSE': v_validate_RMSE}
    
#Convert the dictionary to dataframe
scoreMetrics_varma_AllCols=pd.DataFrame.from_dict(scoreMetrics_values, orient='index').sort_values(
    by=["BIC","Validate MAPE", "Train MAPE","Validate RMSE","Train RMSE"], 
    ascending=[True, True, True,True,True])
display(scoreMetrics_varma_AllCols)
AIC BIC Train MAPE Train RMSE Validate MAPE Validate RMSE
(4,1) 10965.274 11180.270 12.270 48.256 10.413 48.440
(4,0) 11003.509 11182.673 12.294 49.108 10.400 49.626
(3,1) 11049.429 11228.593 12.476 49.603 10.356 49.713
(3,2) 11016.253 11231.249 12.348 49.110 10.242 48.097
(4,3) 10946.554 11233.216 12.074 46.907 10.283 47.382
(4,2) 10992.567 11243.396 12.237 48.373 10.337 48.372
(4,4) 10957.499 11279.994 12.040 46.528 10.533 48.545
(3,0) 11146.452 11289.783 12.605 51.905 10.724 53.874
(3,3) 11052.662 11303.491 12.249 48.415 10.704 50.333
(2,1) 11255.889 11399.219 13.007 54.374 11.160 55.962
(1,1) 11297.120 11404.618 13.028 55.019 11.149 57.356
(2,0) 11310.818 11418.317 13.183 56.085 11.306 59.190
(2,2) 11267.820 11446.983 13.024 53.845 11.942 58.709
(3,4) 11183.450 11470.112 12.408 49.170 10.630 51.614
(1,3) 11300.195 11479.359 13.161 54.000 10.864 55.258
(2,3) 11272.120 11487.116 12.888 53.503 11.800 57.709
(1,0) 11420.343 11492.009 13.773 60.142 11.173 61.756
(0,3) 11366.295 11509.626 15.563 64.783 11.219 56.821
(1,4) 11296.918 11511.915 13.030 53.255 10.932 55.006
(2,4) 11268.282 11519.111 12.930 53.164 11.748 57.280
(1,2) 11384.797 11528.128 13.018 55.792 11.041 59.102
(0,4) 11374.230 11553.394 15.791 66.254 10.994 58.015
(0,2) 11560.365 11667.863 16.094 67.997 11.479 62.410
(0,1) 11783.506 11855.172 16.706 69.487 12.811 69.277
In [127]:
lowestEnergyBIC_order = scoreMetrics_varma_AllCols["BIC"].idxmin()
lowestEnergyBIC_value = scoreMetrics_varma_AllCols["BIC"].min()
print(f"Order with Lowest BIC value: {lowestEnergyBIC_order}")
print(f"Lowest BIC value: {lowestEnergyBIC_value}")

lowestEnergyBIC_cols = scoreMetrics_varma_AllCols.loc[lowestEnergyBIC_order]
display(lowestEnergyBIC_cols)
Order with Lowest BIC value: (4,1)
Lowest BIC value: 11180.27
AIC              10965.274
BIC              11180.270
Train MAPE          12.270
Train RMSE          48.256
Validate MAPE       10.413
Validate RMSE       48.440
Name: (4,1), dtype: float64

Observations:

  • VARMA order of (4,1) has the lowest BIC value so it is the best order to be used for all these columns.
  • Validate MAPE is 2% lower than the train MAPE so it means there is a lower different in errors between the test and predicted data than the train and predicted data.
  • Validate RMSE is also quite close to the value of the train RMSE.
  • Therefore, there should be a better order for the model that would give a better result if more iterations of order are being placed.
In [152]:
#VARMA Predictions for train and test data
columns = ['Gas Consumption (tons)', 'Electricity Consumption (MWh)', 'Water Consumption (tons)']
varma_model = VARMAX(endog=train_set[columns], order=(4,1)).fit(disp=False)
(trainPredicted, testPredicted, pred_full, varma_model, aic, bic,
v_Train_MAPE, v_Train_RMSE, v_Validate_MAPE, v_validate_RMSE)  = walk_forward_validate_varma(
    columns, train_set, test_set, varma_model, (4,1)
)
display(pred_full.head())
Gas Consumption (tons) Electricity Consumption (MWh) Water Consumption (tons)
DATE
1990-01-01 23.111158 849.509284 496.233326
1990-02-01 19.728519 754.302206 518.848577
1990-03-01 18.361752 751.997627 566.820054
1990-04-01 18.031029 624.875206 524.352203
1990-05-01 19.291235 614.024792 518.415145
In [153]:
fig,ax = plt.subplots(3, 1)
fig.suptitle("VARMA Actual VS Predicted", fontsize=14, fontweight="bold")

#Plot actual & predicted values on the graph
index=0
for column in energyConsumption_df.columns:
    energyConsumption_df[column].plot(ax=ax[index], label="actual", color='dodgerblue')
    pred_full[column].plot(ax=ax[index], label='predicted', linestyle='--', color='orange')
    ax[index].legend()
    ax[index].set_title(f'{column}')
    index+=1
plt.tight_layout()
plt.show()

Observations:

  • For all 3 columns, it seems like the predicted data does not seem to match the actual data but it looks much better than the graphs fitted by the previous order.
In [154]:
#Impulse-Response Function
varma_model = VARMAX(endog=train_set[columns], order=(4,1)).fit(disp=False)

gas_irfs = varma_model.impulse_responses(24, impulse=0, orthogonalized=True)
electricity_irfs = varma_model.impulse_responses(24, impulse=1, orthogonalized=True)
water_irfs = varma_model.impulse_responses(24, impulse=2, orthogonalized=True)

#Plot impulse reaction graph for better visualisation
fig, ax = plt.subplots(3, 1, figsize=(15,9))

index=0
irfs=[gas_irfs, electricity_irfs, water_irfs]
for col in columns:
    ax[index].plot(range(0,25), irfs[index], marker='o', label=["Gas","Electricity","Water"])
    ax[index].set_xlabel("Time Steps")
    ax[index].set_ylabel("Impulse-Response Value")
    ax[index].set_title(f"Impulse-Response Function - {col}")
    ax[index].legend()
    ax[index].grid(True)
    index+=1

plt.tight_layout()
plt.show()

Observations:

  • In response to Gas Consumption, Water Consumption has a negative response which shows that Gas Consumption do indeed pose a significant causality to Water Consumption, as supported by the Granger's Causility Test.
  • Electricity Consumption has an autocorrelation with itself over time as oscillation tends to be smaller as it converges towards the IRV of 0.
  • Gradually, the IRF converges to zero gradually as the impulse effects became to die out.
  • If a higher iteration is being tested, the model will generally perform better.
In [134]:
#Perform durbin watson test
dw_results = durbin_watson_test(df=energyConsumption_df, resid=varma_model.resid)
display(dw_results)
Durbin Watson Test Statistic
Gas Consumption (tons) 1.99
Electricity Consumption (MWh) 1.94
Water Consumption (tons) 1.98

Observations:

  • For all columns, Durbin Watson Test Statistic shows that there is a positive autocorrelation but it would relatively performed much better if a higher order of p is being tested.
In [155]:
#Diagnostics plot for Gas Consumption
fig=varma_model.plot_diagnostics(variable=0)
plt.gcf().suptitle('Gas Consumption - Diagnostics', fontsize=18)
ax = fig.axes[0]
ax.set_xlim(250, 557)
ax.set_ylim(-5,9)
ax.axhline(0)
plt.tight_layout()
plt.show()

Observations for Gas Consumption:

  • There is no obvious trend in the residuals in the standardized residual plot, suggests that this is a good model.
  • The histogram is slightly symmetrical and the QQ plot shows all points lying on the line which means that the data follows a normal distribution.
  • The correlogram shows that 95% of the correlations for lag do lie within the threshold, which shows that this is a good model.
In [156]:
#Diagnostics plot for Electricity Consumption
fig=varma_model.plot_diagnostics(variable=1)
plt.gcf().suptitle('Electricity Consumption - Diagnostics', fontsize=18)
ax = fig.axes[0]
ax.set_xlim(240, 555)
ax.set_ylim(-5,9)
ax.axhline(0)
plt.tight_layout()
plt.show()

Observations for Electricity Consumption:

  • There is an obvious pattern but no trend in the residuals in the standardized residual plot, showing that this can be a good fit.
  • The histogram is symmetrical and the QQ plot shows all points lying on the line which means that the data follows a normal distribution.
  • However, the correlogram shows that 95% of the correlations for lag do not lie within the threshold, which shows that this is still not a good model but a much better one than order(1,1).
  • With a greater order of p and much higher processing speed, there could be a better fit for the model.
In [157]:
#Diagnostics plot for Water Consumption
fig=varma_model.plot_diagnostics(variable=2)
plt.gcf().suptitle('Water Consumption - Diagnostics', fontsize=18)
ax = fig.axes[0]
ax.set_xlim(240, 555)
ax.set_ylim(-5,9)
ax.axhline(0)
plt.tight_layout()
plt.show()

Observations for Water Consumption:

  • There is no obvious pattern or trend in the residuals in the standardized residual plot, showing that this model can be a good fit.
  • The histogram is slightly symmetrical and the QQ plot shows all points lying on the line which means that the data follows a normal distribution.
  • The correlogram shows that 95% of the correlations for lag do lie within the threshold, which shows that this is a good model.
In [7]:
#Try out p=4, q=1
columns = ['Gas Consumption (tons)', 'Electricity Consumption (MWh)', 'Water Consumption (tons)']
varma_model = varma_model = VARMAX(endog=train_set[columns], order=(4,1)).fit(disp=False)
#Summary
varma_model.summary()
Out[7]:
Statespace Model Results
Dep. Variable: ['Gas Consumption (tons)', 'Electricity Consumption (MWh)', 'Water Consumption (tons)'] No. Observations: 317
Model: VARMA(4,1) Log Likelihood -4342.989
+ intercept AIC 8793.977
Date: Fri, 11 Aug 2023 BIC 8996.958
Time: 16:21:02 HQIC 8875.058
Sample: 01-01-1990
- 05-01-2016
Covariance Type: opg
Ljung-Box (L1) (Q): 0.00, 0.10, 0.00 Jarque-Bera (JB): 2048.14, 9.22, 71.69
Prob(Q): 0.98, 0.75, 0.99 Prob(JB): 0.00, 0.01, 0.00
Heteroskedasticity (H): 2.16, 1.84, 1.32 Skew: 0.80, 0.35, -0.40
Prob(H) (two-sided): 0.00, 0.00, 0.15 Kurtosis: 15.35, 3.47, 5.19
Results for equation Gas Consumption (tons)
coef std err z P>|z| [0.025 0.975]
intercept 7.3403 14.838 0.495 0.621 -21.742 36.423
L1.Gas Consumption (tons) 0.5092 1.425 0.357 0.721 -2.284 3.303
L1.Electricity Consumption (MWh) 0.0088 0.013 0.669 0.504 -0.017 0.035
L1.Water Consumption (tons) 0.0008 0.050 0.016 0.987 -0.097 0.098
L2.Gas Consumption (tons) 0.1539 0.974 0.158 0.874 -1.755 2.063
L2.Electricity Consumption (MWh) -0.0147 0.020 -0.726 0.468 -0.054 0.025
L2.Water Consumption (tons) -4.867e-05 0.029 -0.002 0.999 -0.057 0.057
L3.Gas Consumption (tons) -0.0361 0.176 -0.205 0.838 -0.382 0.310
L3.Electricity Consumption (MWh) 0.0125 0.019 0.661 0.509 -0.025 0.050
L3.Water Consumption (tons) -0.0032 0.004 -0.791 0.429 -0.011 0.005
L4.Gas Consumption (tons) 0.0809 0.123 0.659 0.510 -0.160 0.321
L4.Electricity Consumption (MWh) -0.0050 0.008 -0.612 0.541 -0.021 0.011
L4.Water Consumption (tons) -0.0016 0.004 -0.350 0.726 -0.010 0.007
L1.e(Gas Consumption (tons)) 0.0218 1.433 0.015 0.988 -2.786 2.830
L1.e(Electricity Consumption (MWh)) -0.0057 0.015 -0.370 0.712 -0.036 0.025
L1.e(Water Consumption (tons)) 0.0013 0.050 0.026 0.979 -0.097 0.099
Results for equation Electricity Consumption (MWh)
coef std err z P>|z| [0.025 0.975]
intercept 21.7296 159.778 0.136 0.892 -291.429 334.888
L1.Gas Consumption (tons) -0.5529 16.260 -0.034 0.973 -32.422 31.316
L1.Electricity Consumption (MWh) 1.4951 0.151 9.888 0.000 1.199 1.791
L1.Water Consumption (tons) 0.0056 0.534 0.011 0.992 -1.042 1.053
L2.Gas Consumption (tons) 0.5905 11.023 0.054 0.957 -21.014 22.195
L2.Electricity Consumption (MWh) -0.9741 0.253 -3.856 0.000 -1.469 -0.479
L2.Water Consumption (tons) 0.0059 0.319 0.018 0.985 -0.619 0.631
L3.Gas Consumption (tons) -0.1692 1.901 -0.089 0.929 -3.895 3.556
L3.Electricity Consumption (MWh) 0.0475 0.235 0.202 0.840 -0.413 0.508
L3.Water Consumption (tons) -0.0100 0.057 -0.176 0.860 -0.122 0.102
L4.Gas Consumption (tons) 0.0841 1.388 0.061 0.952 -2.636 2.804
L4.Electricity Consumption (MWh) 0.4101 0.096 4.290 0.000 0.223 0.597
L4.Water Consumption (tons) -0.0065 0.062 -0.105 0.916 -0.128 0.115
L1.e(Gas Consumption (tons)) 0.3875 16.352 0.024 0.981 -31.661 32.436
L1.e(Electricity Consumption (MWh)) -0.5081 0.159 -3.192 0.001 -0.820 -0.196
L1.e(Water Consumption (tons)) -0.0293 0.529 -0.055 0.956 -1.065 1.007
Results for equation Water Consumption (tons)
coef std err z P>|z| [0.025 0.975]
intercept 82.7297 400.688 0.206 0.836 -702.605 868.064
L1.Gas Consumption (tons) 4.4434 39.346 0.113 0.910 -72.673 81.560
L1.Electricity Consumption (MWh) -0.0783 0.319 -0.245 0.806 -0.704 0.547
L1.Water Consumption (tons) 0.5642 1.303 0.433 0.665 -1.989 3.118
L2.Gas Consumption (tons) -3.2983 26.945 -0.122 0.903 -56.110 49.513
L2.Electricity Consumption (MWh) 0.1979 0.490 0.404 0.686 -0.762 1.158
L2.Water Consumption (tons) 0.0278 0.769 0.036 0.971 -1.479 1.535
L3.Gas Consumption (tons) -0.8847 4.561 -0.194 0.846 -9.823 8.054
L3.Electricity Consumption (MWh) -0.0959 0.440 -0.218 0.828 -0.959 0.767
L3.Water Consumption (tons) 0.0506 0.107 0.471 0.638 -0.160 0.261
L4.Gas Consumption (tons) -1.8767 3.625 -0.518 0.605 -8.982 5.229
L4.Electricity Consumption (MWh) 0.0890 0.213 0.418 0.676 -0.329 0.507
L4.Water Consumption (tons) 0.0731 0.112 0.655 0.512 -0.146 0.292
L1.e(Gas Consumption (tons)) 0.3860 39.455 0.010 0.992 -76.945 77.717
L1.e(Electricity Consumption (MWh)) -0.1048 0.368 -0.285 0.776 -0.827 0.617
L1.e(Water Consumption (tons)) -0.0495 1.311 -0.038 0.970 -2.619 2.520
Error covariance matrix
coef std err z P>|z| [0.025 0.975]
sqrt.var.Gas Consumption (tons) 3.5500 0.127 27.964 0.000 3.301 3.799
sqrt.cov.Gas Consumption (tons).Electricity Consumption (MWh) 2.2208 3.506 0.633 0.526 -4.651 9.093
sqrt.var.Electricity Consumption (MWh) 41.8941 2.223 18.846 0.000 37.537 46.251
sqrt.cov.Gas Consumption (tons).Water Consumption (tons) -46.5015 5.396 -8.618 0.000 -57.078 -35.925
sqrt.cov.Electricity Consumption (MWh).Water Consumption (tons) 8.2834 7.392 1.121 0.262 -6.204 22.771
sqrt.var.Water Consumption (tons) 92.6488 4.063 22.802 0.000 84.685 100.612


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Observations:

  • Ljung Box test shows that the residuals are independently distributed for all 3 variables.
  • Heteroscedasticity test shows that the residuals do not show variance for gas & electricity consumption but residuals show variances for water consumption, suggesting that additional transformation might be needed.
  • Jarque-Bera Test shows the data is not normally distributed against an alternative of another distribution for all 3 variables.

Gas Consumption & Water Consumption¶

  • As shown in Granger's Causality Test & Engle Granger Cointegration Test, I will fit Gas Consumption & Water Consumption into VARIMA to find the best suitable order as there is a causlity and cointegration relationship between these two variables.
In [137]:
scoreMetrics_values={}

#Values of p & q
p_values = np.arange(0, 5)
q_values = np.arange(0, 5)
columns = ['Gas Consumption (tons)', 'Water Consumption (tons)']

for i in list(product(p_values, q_values)):
    order = (i[0], i[1])
    if order!=(0, 0):
        varma_model = VARMAX(endog=train_set[columns], order=order).fit(disp=False)
        (trainPredicted, testPredicted, pred_full, varma_model, aic, bic,
        va_Train_MAPE, va_Train_RMSE, va_Validate_MAPE, va_validate_RMSE)  = walk_forward_validate_varma(
            columns, train_set, test_set, varma_model, order
        )
    
        #concatenate order as a string
        order_string=f'({i[0]},{i[1]})'
    
        #store the score metrics in the dictionaries
        scoreMetrics_values[order_string] = {'AIC': round(aic,3), 'BIC': round(bic,3), 
                                             'Train MAPE': va_Train_MAPE, 'Train RMSE': va_Train_RMSE, 
                                             'Validate MAPE': va_Validate_MAPE, 'Validate RMSE': va_validate_RMSE}
    
#Convert the dictionary to dataframe
scoreMetrics_varma_GasWater=pd.DataFrame.from_dict(scoreMetrics_values, orient='index').sort_values(
    by=["BIC","Validate MAPE", "Train MAPE","Validate RMSE","Train RMSE"], 
    ascending=[True, True, True,True,True])
display(scoreMetrics_varma_GasWater)
AIC BIC Train MAPE Train RMSE Validate MAPE Validate RMSE
(1,1) 6855.624 6907.383 16.783 53.829 12.845 45.666
(2,0) 6857.742 6909.500 16.817 54.298 12.710 45.282
(1,0) 6874.542 6910.374 17.203 55.770 12.596 45.357
(2,1) 6859.265 6926.949 16.828 53.734 12.828 46.545
(3,0) 6860.386 6928.070 16.725 53.806 12.880 46.076
(1,2) 6863.627 6931.311 16.742 53.791 12.808 46.115
(4,0) 6861.252 6944.861 16.620 53.434 12.930 46.470
(2,2) 6863.413 6947.022 16.739 53.331 13.028 46.579
(3,1) 6865.542 6949.151 16.733 53.552 12.819 46.485
(1,3) 6872.424 6956.034 16.844 54.029 12.777 46.091
(3,2) 6865.803 6965.338 16.598 53.092 13.066 47.259
(2,3) 6866.049 6965.584 16.654 53.140 13.069 47.070
(4,1) 6867.627 6967.162 16.656 53.170 13.037 46.824
(1,4) 6882.055 6981.590 17.093 54.257 12.828 45.622
(3,3) 6872.502 6987.963 16.620 52.921 13.322 47.517
(4,2) 6873.994 6989.455 16.573 53.005 13.087 46.716
(0,4) 6906.697 6990.306 18.961 57.230 12.905 45.286
(2,4) 6875.056 6990.517 16.750 53.219 13.235 47.196
(0,3) 6924.961 6992.645 18.779 57.246 13.354 45.689
(0,2) 6954.793 7006.551 19.101 58.304 13.359 46.082
(4,3) 6878.091 7009.478 16.521 52.774 13.217 47.405
(3,4) 6878.726 7010.113 16.612 52.900 13.310 47.785
(4,4) 6885.324 7032.636 16.426 52.675 13.310 47.759
(0,1) 7049.810 7085.643 20.823 61.470 14.862 48.663
In [138]:
lowestGasWaterBIC_order = scoreMetrics_varma_GasWater["BIC"].idxmin()
lowestGasWaterBIC_value = scoreMetrics_varma_GasWater["BIC"].min()
print(f"Order with Lowest BIC value: {lowestGasWaterBIC_order}")
print(f"Lowest BIC value: {lowestGasWaterBIC_value}")

lowestGasWaterBIC_cols = scoreMetrics_varma_GasWater.loc[lowestGasWaterBIC_order]
display(lowestGasWaterBIC_cols)
Order with Lowest BIC value: (1,1)
Lowest BIC value: 6907.383
AIC              6855.624
BIC              6907.383
Train MAPE         16.783
Train RMSE         53.829
Validate MAPE      12.845
Validate RMSE      45.666
Name: (1,1), dtype: float64

Observations:

  • VARMA order of (1,1) has the lowest BIC value so it is the best order to be used for Gas Consumption & Water Consumption columns.
  • Validate MAPE is 4% lower than the train MAPE so it means there is a lower different in errors between the test and predicted data than the train and predicted data.
  • Validate RMSE is also much lower than the value of the train RMSE.
  • Therefore, this is quite a good model for these features.
In [139]:
#VARMA Predictions for train and test data
columns = ['Gas Consumption (tons)', 'Water Consumption (tons)']
varma_model = VARMAX(endog=train_set[columns], order=(1,1)).fit(disp=False)
(trainPredicted, testPredicted, pred_full, varma_model, aic, bic,
va_Train_MAPE, va_Train_RMSE, va_Validate_MAPE, va_validate_RMSE)  = walk_forward_validate_varma(
    columns, train_set, test_set, varma_model, (1,1)
)
display(pred_full.head())
Gas Consumption (tons) Water Consumption (tons)
DATE
1990-01-01 23.160686 489.703408
1990-02-01 19.559033 528.456274
1990-03-01 18.258346 577.290008
1990-04-01 18.045345 535.884574
1990-05-01 19.383292 526.591122
In [143]:
fig,ax = plt.subplots(2, 1)
fig.suptitle("VARMA Actual VS Predicted", fontsize=14, fontweight="bold")

#Plot actual & predicted values on the graph
index=0
for column in columns:
    energyConsumption_df[column].plot(ax=ax[index], label="actual", color='dodgerblue')
    pred_full[column].plot(ax=ax[index], label='predicted', linestyle='--', color='orange')
    ax[index].legend()
    ax[index].set_title(f'{column}')
    index+=1
plt.tight_layout()
plt.show()

Observations:

  • For these 2 columns, it seems like the predicted data seems to be a good estimate of match the actual data.
In [148]:
#Impulse-Response Function
varma_model = VARMAX(endog=train_set[columns], order=(1,1)).fit(disp=False)

gas_irfs = varma_model.impulse_responses(24, impulse=0, orthogonalized=True)
water_irfs = varma_model.impulse_responses(24, impulse=1, orthogonalized=True)

#Plot impulse reaction graph for better visualisation
fig, ax = plt.subplots(2, 1, figsize=(15,9))

index=0
irfs=[gas_irfs, water_irfs]
for col in columns:
    ax[index].plot(range(0,25), irfs[index], marker='o', label=["Gas", "Water"])
    ax[index].set_xlabel("Time Steps")
    ax[index].set_ylabel("Impulse-Response Value")
    ax[index].set_title(f"Impulse-Response Function - {col}")
    ax[index].legend()
    ax[index].grid(True)
    index+=1

plt.tight_layout()
plt.show()

Observations:

  • In response to Gas Consumption, Water Consumption has a negative response which shows that Gas Consumption do indeed pose a significant causality to Water Consumption, as supported by the Granger's Causility Test.
  • Gradually, the IRF converges to zero gradually as the impulse effects became to die out.
In [142]:
#Perform durbin watson test
dw_results = durbin_watson_test(df=energyConsumption_df[columns], resid=varma_model.resid)
display(dw_results)
Durbin Watson Test Statistic
Gas Consumption (tons) 2.06
Water Consumption (tons) 1.92

Observations:

  • For Gas Consumption, Durbin Watson Test Statistic shows that there is a negative autocorrelation as test statistic is larger than 2 while there is a positive autocorrelation for Water Consumption as test statistic is lesser than 2.
In [149]:
#Diagnostics plot for Gas Consumption
fig=varma_model.plot_diagnostics(variable=0)
plt.gcf().suptitle('Gas Consumption - Diagnostics', fontsize=18)
ax = fig.axes[0]
ax.set_xlim(250, 557)
ax.set_ylim(-5,9)
ax.axhline(0)
plt.tight_layout()
plt.show()

Observations for Gas Consumption:

  • There is no obvious trend in the residuals in the standardized residual plot, suggests that this is a good model.
  • The histogram is slightly symmetrical and the QQ plot shows all points lying on the line which means that the data follows a normal distribution.
  • The correlogram shows that 95% of the correlations for lag do lie within the threshold, which shows that this is a good model.
In [150]:
#Diagnostics plot for Water Consumption
fig=varma_model.plot_diagnostics(variable=1)
plt.gcf().suptitle('Water Consumption - Diagnostics', fontsize=18)
ax = fig.axes[0]
ax.set_xlim(240, 555)
ax.set_ylim(-5,9)
ax.axhline(0)
plt.tight_layout()
plt.show()

Observations for Water Consumption:

  • There is no obvious pattern or trend in the residuals in the standardized residual plot, showing that this model can be a good fit.
  • The histogram is slightly symmetrical and the QQ plot shows all points lying on the line which means that the data follows a normal distribution.
  • The correlogram shows that 95% of the correlations for lag do lie within the threshold, which shows that this is a good model.
In [6]:
#Try out p=1, q=1
columns = ['Gas Consumption (tons)', 'Water Consumption (tons)']
varma_model = varma_model = VARMAX(endog=train_set[columns], order=(1,1)).fit(disp=False)
#Summary
varma_model.summary()
Out[6]:
Statespace Model Results
Dep. Variable: ['Gas Consumption (tons)', 'Water Consumption (tons)'] No. Observations: 317
Model: VARMA(1,1) Log Likelihood -2744.862
+ intercept AIC 5515.725
Date: Fri, 11 Aug 2023 BIC 5564.591
Time: 16:19:32 HQIC 5535.244
Sample: 01-01-1990
- 05-01-2016
Covariance Type: opg
Ljung-Box (L1) (Q): 0.43, 1.54 Jarque-Bera (JB): 2157.75, 59.97
Prob(Q): 0.51, 0.21 Prob(JB): 0.00, 0.00
Heteroskedasticity (H): 2.12, 1.39 Skew: 0.79, -0.41
Prob(H) (two-sided): 0.00, 0.09 Kurtosis: 15.68, 4.97
Results for equation Gas Consumption (tons)
coef std err z P>|z| [0.025 0.975]
intercept 7.9184 2.261 3.503 0.000 3.487 12.349
L1.Gas Consumption (tons) 0.7724 0.067 11.583 0.000 0.642 0.903
L1.Water Consumption (tons) -0.0054 0.002 -2.267 0.023 -0.010 -0.001
L1.e(Gas Consumption (tons)) -0.1743 0.069 -2.545 0.011 -0.309 -0.040
L1.e(Water Consumption (tons)) 0.0097 0.004 2.746 0.006 0.003 0.017
Results for equation Water Consumption (tons)
coef std err z P>|z| [0.025 0.975]
intercept 133.9090 58.055 2.307 0.021 20.123 247.695
L1.Gas Consumption (tons) -1.5998 1.601 -0.999 0.318 -4.738 1.539
L1.Water Consumption (tons) 0.8022 0.065 12.250 0.000 0.674 0.931
L1.e(Gas Consumption (tons)) 2.9725 1.916 1.552 0.121 -0.782 6.727
L1.e(Water Consumption (tons)) -0.3416 0.092 -3.708 0.000 -0.522 -0.161
Error covariance matrix
coef std err z P>|z| [0.025 0.975]
sqrt.var.Gas Consumption (tons) 3.6825 0.096 38.265 0.000 3.494 3.871
sqrt.cov.Gas Consumption (tons).Water Consumption (tons) -51.7898 5.275 -9.817 0.000 -62.129 -41.450
sqrt.var.Water Consumption (tons) 95.0667 3.322 28.615 0.000 88.555 101.578


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Observations:

  • Ljung Box test shows that the residuals are independently distributed for all 2 variables.
  • Heteroscedasticity test shows that the residuals do not show variance for gas consumption but residuals show variances for water consumption.
  • Jarque-Bera Test shows the data is not normally distributed against an alternative of another distribution for all 3 variables.

6) Summary¶

In summary, SARIMA works better for Electricity Consumption which clearly show seasonality in the data. ARIMA tends to perform better for Gas Consumption & Water Consumption which do not show seasonality in the data. VARMA work well when there is causality and cointegration between multiple variables but it will require additional data transformation as shown. Model improvement to find the most suitable order for each data will require more time and higher processing speed as larger number of iterations will be required to find the best order. Vector Error Correction Models (VECM) could be run if there was more time for variables that show long term relationship and no short term relationship between one another based on the Engle-Granger Cointegration Test results & Granger Causality Test.